Univerzitet u Kragujevcu
Prirodno-Matematički fakultet
Institut za matematiku i informatiku
No description has been provided for this image

Life Expectancy (WHO)

Seminarski rad iz predmeta Uvod u nauku o podacima

Profesor: Branko Arsić, Ph.D.
Članovi tima:
Luka Mladićević 69/2022
Emilija Djordjević 49/2022

Uvod i opis problema¶

Očekivani životni vek predstavlja jedan od najvažnijih pokazatelja kvaliteta života i razvijenosti jednog društva. On ne odražava samo zdravstveno stanje populacije, već i nivo ekonomskog razvoja, obrazovanja, dostupnost medicinske zaštite, higijenske uslove, ishranu, političku stabilnost i mnoge druge faktore.

U savremenom svetu, zahvaljujući velikim količinama dostupnih podataka i napretku u oblasti mašinskog učenja, moguće je analizirati faktore koji utiču na očekivani životni vek i modelovati njihove međusobne odnose kroz različite implementacije.

U ovom radu primenjivaćemo metode analize podataka i regresione modele kako bi istražili koji faktori imaju najjači uticaj na očekivani životni vek i koliko precizno je moguće predvideti njegovu vrednost na osnovu dostupnih socio-ekonomskih i zdravstvenih pokazatelja. Korišćenjem tehnika poput regularizacije, selekcije atributa i evaluacije modela nad trening i test skupom, cilj je dobiti robustan i interpretabilan model koji ne samo da predviđa, već i objašnjava obrasce u podacima.

Opis problema¶

Iz perspektive mašinskog učenja, problem predikcije očekivanog životnog veka predstavlja zadatak regresije. Cilj je na osnovu poznatih karakteristika jedne zemlje - kao što su stopa smrtnosti odraslih (Adult Mortality), BDP, nivo obrazovanja, stopa imunizacije, zastupnost bolesti, potrošnja na zdravstvo i drugi indikatori, predvideti životni vek populacije.

Skup podataka obuhvata više zemalja kroz različite vremenske periode i sadrži kombinaciju numeričkih i kategorijskih promenljivih. Takva struktura podataka uvodi nekoliko izazova:

Visoka dimenzionalnost: veći broj potencijalnih prediktora može dovesti do prekomernog prilagođavanja (overfitting), zbog čega je neophodna pažljiva selekcija atributa.

Multikolinearnost: pojedini socio-ekonomski indikatori su međusobno snažno povezani, što može destabilizovati klasične regresione modele.

Različite skale i distribucije podataka: određene promenljive pokazuju izraženu asimetriju i prisustvo ekstremnih vrednosti, zbog čega je potrebna transformacija (npr. log transformacija).

Razlika između razvijenih i nerazvijenih zemalja: podaci pokazuju jasnu strukturnu podelu, što može uticati na interpretaciju modela i stabilnost koeficijenata.

Motivacija ovog rada nije samo izgradnja modela sa optimalnim metrikama, već razumevanje strukture podataka i identifikovanje faktora koji najviše doprinose dužem životnom veku. Analizom koeficijenata, značajnosti promenljivih i poređenjem različitih modela (uključujući regularizovane pristupe poput Lasso i Ridge regresije), dolazi se do uvida u to kako zdravstveni, ekonomski i društveni faktori oblikuju dugovečnost populacije.

Krajnji cilj projekta je konstruisati model koji može da objasni što veću varijabilnost očekivanog životnog veka.

Početna konfiguracija¶

In [2]:
!pip install pandas 
!pip install numpy 
!pip install seaborn 
!pip install scipy 
!pip install requests 
!pip install scikit-learn 
!pip install statsmodels 
!pip install matplotlib
!pip install xgboost
Requirement already satisfied: pandas in ./.venv/lib/python3.12/site-packages (2.3.3)
Requirement already satisfied: numpy>=1.26.0 in ./.venv/lib/python3.12/site-packages (from pandas) (2.4.1)
Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.12/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas) (2025.3)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (2.4.1)
Requirement already satisfied: seaborn in ./.venv/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./.venv/lib/python3.12/site-packages (from seaborn) (2.4.1)
Requirement already satisfied: pandas>=1.2 in ./.venv/lib/python3.12/site-packages (from seaborn) (2.3.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./.venv/lib/python3.12/site-packages (from seaborn) (3.10.8)
Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.3)
Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.61.1)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0)
Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (12.1.1)
Requirement already satisfied: pyparsing>=3 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.3.2)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2025.3)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)
Requirement already satisfied: scipy in ./.venv/lib/python3.12/site-packages (1.17.0)
Requirement already satisfied: numpy<2.7,>=1.26.4 in ./.venv/lib/python3.12/site-packages (from scipy) (2.4.1)
Requirement already satisfied: requests in ./.venv/lib/python3.12/site-packages (2.32.5)
Requirement already satisfied: charset_normalizer<4,>=2 in ./.venv/lib/python3.12/site-packages (from requests) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.12/site-packages (from requests) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.12/site-packages (from requests) (2.6.3)
Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.12/site-packages (from requests) (2026.1.4)
Requirement already satisfied: scikit-learn in ./.venv/lib/python3.12/site-packages (1.8.0)
Requirement already satisfied: numpy>=1.24.1 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (2.4.1)
Requirement already satisfied: scipy>=1.10.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (1.17.0)
Requirement already satisfied: joblib>=1.3.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.2.0 in ./.venv/lib/python3.12/site-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: statsmodels in ./.venv/lib/python3.12/site-packages (0.14.6)
Requirement already satisfied: numpy<3,>=1.22.3 in ./.venv/lib/python3.12/site-packages (from statsmodels) (2.4.1)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in ./.venv/lib/python3.12/site-packages (from statsmodels) (1.17.0)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in ./.venv/lib/python3.12/site-packages (from statsmodels) (2.3.3)
Requirement already satisfied: patsy>=0.5.6 in ./.venv/lib/python3.12/site-packages (from statsmodels) (1.0.2)
Requirement already satisfied: packaging>=21.3 in ./.venv/lib/python3.12/site-packages (from statsmodels) (25.0)
Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.12/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2025.3)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.17.0)
Requirement already satisfied: matplotlib in ./.venv/lib/python3.12/site-packages (3.10.8)
Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.3.3)
Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (4.61.1)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.12/site-packages (from matplotlib) (1.4.9)
Requirement already satisfied: numpy>=1.23 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.4.1)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.12/site-packages (from matplotlib) (25.0)
Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.12/site-packages (from matplotlib) (12.1.1)
Requirement already satisfied: pyparsing>=3 in ./.venv/lib/python3.12/site-packages (from matplotlib) (3.3.2)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)
Requirement already satisfied: xgboost in ./.venv/lib/python3.12/site-packages (3.2.0)
Requirement already satisfied: numpy in ./.venv/lib/python3.12/site-packages (from xgboost) (2.4.1)
Requirement already satisfied: nvidia-nccl-cu12 in ./.venv/lib/python3.12/site-packages (from xgboost) (2.29.3)
Requirement already satisfied: scipy in ./.venv/lib/python3.12/site-packages (from xgboost) (1.17.0)
In [98]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import requests
import math

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
from sklearn.metrics import roc_auc_score
from scipy.stats import shapiro
from scipy.stats import chi2_contingency
from scipy.stats import shapiro, ttest_ind, mannwhitneyu, f_oneway, kruskal, spearmanr
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

import matplotlib.pyplot as plt

Učitavanje podataka¶

In [82]:
dataframe = pd.read_csv("life_expectancy_data.csv")
dataframe.columns = dataframe.columns.str.strip()
dataframe.head()
Out[82]:
Country Year Status Life expectancy Adult Mortality infant deaths Alcohol percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 10-19 years thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5

5 rows × 22 columns

In [212]:
dataframe = dataframe.replace(" ",np.nan)
In [213]:
dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10  BMI                              2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15  HIV/AIDS                         2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18  thinness 10-19 years             2904 non-null   float64
 19  thinness 5-9 years               2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB

Exploratory Data Analaysis¶

Sada nakon što smo učitali podatke, možemo krenuti u razmatranje istih. Pred razmatranje postavljamo par pitanja "Koju promenljivu treba naš model da prediktuje?" "U kojoj zavisnosti je ciljana promenljiva sa ostalim promenljivama skupa podataka?" "Kako da opišemo te zavisnosti?" itd. Kako bismo odgovorili na sva ova pitanja, vodimo se primranom metodom za opisivanje podataka - Eksplorativna Analiza Podataka. Ideja ove metode je da putem grafičkih reprezentacija napravimo uvid u odnos ciljne promenljive sa svim ostalim nezavisnim promenljivama, ovo radimo kako bismo pronašli linearne zavisnosti, korelacije i osobine ostalih promenljivih koje mogu opisati ciljnu promenljivu odnosno, želimo da uočimo koje promenljive su relevantne i koje irelevantne za naš model.

Krenimo prvo od uopštenog opisa svih podataka dataset-a.

In [6]:
dataframe.describe(include="all").T
Out[6]:
count unique top freq mean std min 25% 50% 75% max
Country 2938 193 Afghanistan 16 NaN NaN NaN NaN NaN NaN NaN
Year 2938.0 NaN NaN NaN 2007.51872 4.613841 2000.0 2004.0 2008.0 2012.0 2015.0
Status 2938 2 Developing 2426 NaN NaN NaN NaN NaN NaN NaN
Life expectancy 2928.0 NaN NaN NaN 69.224932 9.523867 36.3 63.1 72.1 75.7 89.0
Adult Mortality 2928.0 NaN NaN NaN 164.796448 124.292079 1.0 74.0 144.0 228.0 723.0
infant deaths 2938.0 NaN NaN NaN 30.303948 117.926501 0.0 0.0 3.0 22.0 1800.0
Alcohol 2744.0 NaN NaN NaN 4.602861 4.052413 0.01 0.8775 3.755 7.7025 17.87
percentage expenditure 2938.0 NaN NaN NaN 738.251295 1987.914858 0.0 4.685343 64.912906 441.534144 19479.91161
Hepatitis B 2385.0 NaN NaN NaN 80.940461 25.070016 1.0 77.0 92.0 97.0 99.0
Measles 2938.0 NaN NaN NaN 2419.59224 11467.272489 0.0 0.0 17.0 360.25 212183.0
BMI 2904.0 NaN NaN NaN 38.321247 20.044034 1.0 19.3 43.5 56.2 87.3
under-five deaths 2938.0 NaN NaN NaN 42.035739 160.445548 0.0 0.0 4.0 28.0 2500.0
Polio 2919.0 NaN NaN NaN 82.550188 23.428046 3.0 78.0 93.0 97.0 99.0
Total expenditure 2712.0 NaN NaN NaN 5.93819 2.49832 0.37 4.26 5.755 7.4925 17.6
Diphtheria 2919.0 NaN NaN NaN 82.324084 23.716912 2.0 78.0 93.0 97.0 99.0
HIV/AIDS 2938.0 NaN NaN NaN 1.742103 5.077785 0.1 0.1 0.1 0.8 50.6
GDP 2490.0 NaN NaN NaN 7483.158469 14270.169342 1.68135 463.935626 1766.947595 5910.806335 119172.7418
Population 2286.0 NaN NaN NaN 12753375.120052 61012096.508428 34.0 195793.25 1386542.0 7420359.0 1293859294.0
thinness 10-19 years 2904.0 NaN NaN NaN 4.839704 4.420195 0.1 1.6 3.3 7.2 27.7
thinness 5-9 years 2904.0 NaN NaN NaN 4.870317 4.508882 0.1 1.5 3.3 7.2 28.6
Income composition of resources 2771.0 NaN NaN NaN 0.627551 0.210904 0.0 0.493 0.677 0.779 0.948
Schooling 2775.0 NaN NaN NaN 11.992793 3.35892 0.0 10.1 12.3 14.3 20.7

Za dosta opisnih polja uočavamo da vrednost nije broj, odnosno da vrednost nedostaje što nam na prvi pogled daje naznaku da će ovaj set podataka biti problematičan za čišćenje. Kod ostalih podataka možemo uglavnom videti manje više očekivane raspodele. Na prvi pogled za promenljivu BMI vidimo da ima jako čudne vrednosti, mean = 38 na svetskom nivou bi ukazivalo na to da smo verovatno napokon prevazišli glad u Africi. Percentage expenditure takodje ima nelogičnu srednju vrednost koja prelazi 100%

Dalje procene ćemo svakako izvršiti pošto ćemo svaku promenljivu posmatrati zasebno.

Life Expectancy¶

Promenljiva Life expectancy predstavlja prosečan broj godina koje se očekuje da će novorođena osoba živeti.

Ova promenljiva je jedan od najvažnijih pokazatelja ukupnog nivoa razvoja jedne zemlje, jer indirektno odražava kvalitet zdravstvenog sistema, životni standard, nivo obrazovanja, pristup čistoj vodi i sanitaciji, ishranu, bezbednost, kao i socio-ekonomske uslove. Veće vrednosti ove promenljive ukazuje da je država za koju vršimo predvidjanje stabilna, razvijena, dosta ulaže u zdravstveni sistem, ima visok BDP po glavi stanovnika, ne postoje zarazne bolesti koje haraju tom državom i slično. Sa druge strane, niže vrednosti često su povezane sa siromaštvom, zaraznim bolestima, političkom nestabilnošću i slabom zdravstvenom infrastrukturom. Ideja je da napravimo model koji će predvidjati vrednosti za ovu promenljivu na osnovu ostalih socio-ekonomskih faktora (promenljivih) kako bi objasnili njen nivo.

In [7]:
dataframe[["Life expectancy"]].describe().T.join(
    pd.DataFrame({
    "median" : [dataframe["Life expectancy"].median()]
    },index=["Life expectancy"])
)
Out[7]:
count mean std min 25% 50% 75% max median
Life expectancy 2928.0 69.224932 9.523867 36.3 63.1 72.1 75.7 89.0 72.1

Vidimo da na osnovu podataka kojima bratamo u globalu, osobe žive ≈ 69 godina. Posmatrajmo sada distribuciju ciljne promenljive, u zavisnosti od potrebe, možemo i transformisati ciljnu promenljivu u slučaju da je njena raspodela Right Skewed logaritamskom transformacijom.

In [8]:
plt.figure(figsize=(8, 5))
plt.hist(dataframe["Life expectancy"], bins=30,edgecolor="black",linewidth=1)
plt.xlabel("Life expectancy (godine)")
plt.ylabel("Godina starosti")
plt.title("Distribucija Life Expectancy")
plt.show()
No description has been provided for this image

Vidimo da je raspodela "Life Expectancy" promenljive blago Left Skewed što nam generalno naznačava da je skroz okej da je zadržimo takvu kakva je, odnosno nije nam potreban bilo kakva transformacija nad promenljivom posebno zato što ne možemo videti ni izrazite outliere na grafiku.

Sada možemo krenuti u razmatranje nezavisnih promenljivih.

COUNTRY¶

Promenljiva Country predstavlja državu. Sagledajmo sada od koliko jedinstvenih država se naš dataset sastoji.

In [9]:
uniques = dataframe["Country"].nunique()

print("Broj jedinstvenih država u datasetu :",uniques)
Broj jedinstvenih država u datasetu : 193

Pošto je broj jedinistvenih država velik, posmatraćemo vrednosti samo za 15 država.

In [287]:
top_countries = dataframe["Country"].value_counts().head(15).index

dataframe[dataframe["Country"].isin(top_countries)].boxplot(
    column="Life expectancy",
    by="Country",
    figsize=(10, 6),
    rot=45
)

plt.title("Life Expectancy grupisan po Country")
plt.suptitle("")
plt.grid(False)
plt.show()
No description has been provided for this image

Ovim grafikom vidimo odnos Life Expectancy za svaku državu, odnosno podatke koliko je životno očekivanje za svaku zabeleženu godinu po državi. Na grafiku možemo uočiti i par vrednosti van "whiskers-a" što naznačava outlier vrednosti. U svakom slučaju promenljiva Country se ne čini kao pouzdani prediktor pošto ne postoji dovoljan broj zabeleženih godina za svaku državu, dodatno što je broj unikatnih država poprilično velik što može biti problem pri enkodiranju ove promenljive što bi proizvelo popriličnu kompleksnost modela.

YEAR¶

Promenljiva Year predstavlja godinu zapisa svih faktora jedne države. Kada bi postojalo dovoljno ovakvih zapisa mogli bismo i predvidjati Očekivani životni vek nacija za narednu godinu u poredjenju sa podacima prošlih godina.

Svakako prvo ćemo iscrtati boxplot grafik za Year i Life Expectancy.

In [288]:
years = dataframe["Year"].value_counts().index

dataframe[dataframe["Year"].isin(years)].boxplot(
    column="Life expectancy",
    by="Year",
    figsize=(10, 6),
    rot=45
)

plt.title("Life Expectancy grupisan po Year")
plt.suptitle("")
plt.grid(False)
plt.show()
No description has been provided for this image

Sa grafika očigledno vidimo da imamo zapise za samo 16 godina, što nije dovoljno da za svaku državu predvidjamo životni vek zasebno, posebno bi bilo teško sprovesti ovo običnom linearnom regresijom. Posmatranjem 2005. vidimo da postoji više outlier-a, oni mogu biti naznaka nekog rata, epidemije, ili katastrofe u kojem je preminuo veći broj država nego uobičajeno.

Status¶

In [289]:
dataframe["Status"].unique()
Out[289]:
array(['Developing', 'Developed'], dtype=object)

Promenljiva Status je kategorijska promenljiva i ima dve vrednosti "Developing" i "Developed". Na osnovu domenskog znanja, znamo da sve države koje su Razvijene ("Developed") imaju veći BDP po glavi stanovnika, bolje uslove za život, bolji zdravstveni sistem i pobudjenu svest o bitnosti zdravlja, u tom smislu ova promenljiva postavlja čistu granicu socio-ekonosmkih i razvojnih osobina država. S toga ćemo sve dalje grafike predstavljati koristeći i ovu kategorijsku promenljivu.

In [13]:
status = dataframe["Status"].value_counts().index

dataframe[dataframe["Status"].isin(status)].boxplot(
    column="Life expectancy",
    by="Status",
    figsize=(10, 6),
    rot=45
)

plt.title("Life Expectancy grupisan po Status")
plt.suptitle("")
plt.grid(False)
plt.show()
No description has been provided for this image

Grafik dokazuje da je naša pretpostavka na osnovu domenskog znanja tačna, te da su sve vrednosti boxplota za razvijene države uže grupisane oko gornjih vrednosti Life expectency-a sa većom prosečnom vrednošću u odnosu na nerazvijene države, s toga vidimo da Status zaista čini jak kategorijski razgranitelj za ciljnu promenljivu.

Adult Mortality¶

In [292]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Adult Mortality"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Adult Mortality")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Adult Mortality by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Adult Mortality predstavlja broj smrtnih slučajeva na 1000 stanovnika. Odavde vidimo negativnu ali poprilično jaku povezanost Adult Mortality-a i Life Expectancy-a (što je Adult Mortality veci to je manji Life expectancy) Uz to da nam outlier-i (donji desni podaci), ukazuju na trend koji je mozda izazvan epidemijom, ratovima, katastrofe itd. Jasno možemo razgraničiti da "Developed" države se grupišu oko levog gornjeg ugla grafika što je očekivano i dodatno podstiče značajnost "Status" promenljive. Iako se ova promenljiva čini kao dobar prediktor, ne smemo je koristiti u predikciji jer ona predstavlja "Data leakage", odnosno Adult Moratilty direktno opisuje Life Expectancy (Adult Mortality je praktično sadržan u promenljivoj Life expectancy) čime bi mogli da dostignemo nerealno visoke performanse modela ali time ne bi ostvarili prave prediktivne vrednosti u praksi.

infant deaths¶

In [34]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["infant deaths"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("infant deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs infant deaths by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva infant deaths pokazuje apsolutan broj infant deaths na 1000 stanovnika, pa trend od preko 1000 infant death sigurno predstavlja data error što je veoma smisleno s obzirom da znamo da je dosta podataka ovog dataseta scrapeovano sa interneta i dolaze sa različitih izvora. Kao i u prethodnim razmatranjima, vidimo da promenljiva Status dobro razgraničava očekivani životni vek. Možemo smatrati da je ova promenljiva jako ozbiljan indikator u odredjivanju životnog veka jedne populacije s obzirom da se za države sa velikim brojem smrti novorodjenčadi odlikuje jako loš zdravstveni sistem kao i svest o brizi novorodjene dece. Zaključivši ovo, ustanovićemo da sve države za koje infant deaths premašuje 200 ima jako mali Life expectancy što se očigledno i vidi sa grafika. Kako bismo pravilno posmatrali raspodelu ove promenljive, postavićemo plafon vrednosti za infant deaths na 150 pri razmatranju.

In [35]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["infant deaths"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlim(0,150)
plt.xlabel("infant deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs infant deaths by Status")
plt.legend()
plt.show()
No description has been provided for this image

Posmatrajući ovako limitiran grafik, vidimo da je promenljiva infant deaths očigledno right skewed što nam pruža mogućnost da odradimo logaritamsku transformaciju nad podacima. Takodje jedna od opcija bi bila da razdvojimo ovu promenljivu na tri kategorije low , medium , high. Od posebnog značaja nam je transformacija nad ovom promenljivom kako bismo ublažili efekat outliera.

ALCOHOL¶

In [33]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Alcohol"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Alcohol")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Alcohol by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [17]:
filtered_df_alcohol = (
    dataframe.loc[dataframe["Alcohol"] >= 15,
                  ["Country","Alcohol"]]
    .sort_values(by="Alcohol", ascending=False)
)
filtered_df_alcohol
Out[17]:
Country Alcohol
874 Estonia 17.87
228 Belarus 17.31
873 Estonia 16.99
875 Estonia 16.58
227 Belarus 16.35
876 Estonia 15.52
1523 Lithuania 15.19
1525 Lithuania 15.14
877 Estonia 15.07
872 Estonia 15.04
1524 Lithuania 15.04

Promenljiva Alcohol predstavlja konzumaciju alkohola na nivou glavnih gradova zabeleženih država. Posmatrajući grafik ne vidimo jaku linearnu povezanost alkohola i Life expectancy-a, povezanost bi se mogla posmatrati u vidu logaritamske funkcije zbog desne asimetrije što nam naznačava da i ova promenljiva dolazi u obzir za logaritamsku transformaciju. Osmatrajući i države koje su imale konzumaciju sa više od 15 litara po glavi stanovnika, ovi podaci ne deluju kao outlieri pošto su ovo države istočne Evrope poznate po velikom konzumiranju alkohola. Imamo i zemlje koje imaju veliku konzumaciju alkohola ali su pak razvijene, imaju dobru medicinu itd pa zbog toga zadrzavaju solidan life expectancy, sto nam ukazuje da je alkohol jasno povezan sa razvojem države, očekujemo da gradjani razvijene države imaju veću svest o načinu na koji konzumiraju alkohol (manje količine ali češće, pojedini i na dnevnom nivou). Promenljiva svakako dolazi u obzir pri razmatranju Life expectancy promenljive s obzirom da se u paru sa promenljivom Status jasno vidi efekat na ciljnu promenljivu.

PERCENTAGE EXPENDITURE¶

In [32]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["percentage expenditure"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("percentage expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs percentage expenditure by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [31]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["percentage expenditure"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlim(0,2500)
plt.xlabel("percentage expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs percentage expenditure by Status (<=2500)")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Percentage expenditure predstavlja potrošnju na zdravstvo po glavi stanovnika, slutimo da je vrlo moguća multikolinearnost sa GDP s toga je jako bitno da pri feature selection-u proverimo VIF metrikom korelacije. Jasno vidimo stub sa leve strane, koji ima raspodelu od minimuma do maksimuma za life expectancy, sto znaci da i drugi faktori jasno uticu na life expectancy ali ujedno da potrošnja za vrednosti do 2500 veoma jako utiče na life expectancy, dok otprilike preko 2500 dolazi do zasićenja, i ne vidimo rast u life expectancy-u. Outlier-i nam ovde prerdstavlju life expectancy za koje je visok Percentage expenditure a Life expecntacy ima vrednosti <50 pošto je poprilično ispod prosečnog očekivanog životnog veka na globalnom nivou. Oni ne moraju nužno biti uklonjeni pošto možda ukazuju na realne situacije (rat, epidemija...) Posmatranjem raspodele takodje vidimo da i ova promenljiva može biti pogodna za logaritamsku transformaciju, ali u slučaju da ona nije multikolinearna.

Hepatitis B¶

In [47]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Hepatitis B"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Hepatitis B")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Hepatitis B (%) by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Hepatitis B opisuje imunizaciju medju 1-godisnjom decom u procentima. Postoji direktna povezanost sa life expectancy-om ali veza nije linerna (dosta tačaka sa visokom imunizacijom i Life expectancy-em), poprilicno je raspršena, može se upotrebiti kao kategorijska promenljiva ili je možemo spojiti sa ostalim promenljivama koje opisuju imunizacije neke bolesti stvorivši imunološki indeks. Takodje imamo jasne high leverage point-ove (0-15%,95-100%), gde vrednosti 0-15% očekujemo da odlikuju siromašne države dok za države koje poseduju ove vrednosti ali da su pritom razvijene smatramo da predstavljaju informativne outliere, gde odredjeni primeri imaju mali Life expectancy iako imaju jak % imunizacije, što ukazuje na uticaj drugih faktora.

MEASLES¶

In [46]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Measles"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Measles")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Measles by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [22]:
filtered_df_measles = (
    dataframe.loc[dataframe["Measles"] >= 100000,
                  ["Country","Measles"]]
    .sort_values(by="Measles", ascending=False)
)
filtered_df_measles
Out[22]:
Country Measles
1908 Nigeria 212183
731 Democratic Republic of the Congo 182485
1907 Nigeria 168107
1905 Nigeria 141258
725 Democratic Republic of the Congo 133802
567 China 131441
570 China 124219
1575 Malawi 118712
1903 Nigeria 110927
568 China 109023

Promenljiva Measles predstavlja broj prijavljenih slučajeva malih boginja na 1000 stanovnika. Dosta podataka za Measles pivotira oko 0, što je normalan indikator pošto većina drzava nema prijavljen veliki broj slučajeva malih boginja, očigledno se ne može uočiti direktna linearna veza izmedju slučajeva malih boginja i life expectancy-a. Ekstremni slučajevi (>100 000) ukazuju na epidemije malih boginja, ovi high leverage podaci su opravdano veliki za te države i godine ali ne mogu ukazati na znatno bolji životni standard koji samim tim utiče na Life expectancy jer epidemije malih boginja mogu biti prisutne u većini delova sveta.

BMI¶

In [45]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["BMI"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("BMI")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs BMI")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva BMI predstavlja indeks telesne mase i koristi se kako bi opisala gojaznost osoba. BMI možemo izračunati tako što podelimo težinu osobe u kilogramima sa kvadratom visine te osobe. Može se uočiti solidna linearna veza BMI i Life expectancy ali je očigledno da dobar deo ovih podataka predstavlja data errore pošto se za dosta država odlikuje da njihove populacije imaju prosečan BMI od preko 40 što je nerealno s obzirom da države poput Nauru, Američke Samoe, Tokelau koje se smatraju za države sa najvećom vrednošću BMI imaju prosečan BMI od ~34. Ovakvi podaci na nivou države totalno nemaju smisla. Najbolja odluka za ovaj feature bi bio dropping celog feature-a.

under-five deaths¶

In [44]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["under-five deaths"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("under-five deaths")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs under-five deaths by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva under-five deaths predstavlja broj preminule dece uzrasta manjeg od 5 godina. Pošto već imamo promenljivu koja posmatra broj preminule novorodjenčadi, posmatrajući raspodele ove dve promenljive, zaključujemo da iziskuju praktično identične podatke, s toga nam je za odabir prediktora modela svejedno koju ćemo od te dve promenljive odabrati.

POLIO¶

In [43]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Polio"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Polio")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Polio deaths by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Polio predstavlja procentualni broj vakcinisanih 1-godišnjaka. Na osnovu raspodele možemo doći do praktično istih zapažanja kao za promenljivu Hepatitis B. Pošto je ova promenljiva na istoj skali kao i promenljiva Hepatitis B možemo je kombinovati kako bismo napravili imunizacioni index države.

TOTAL EXPENDITURE¶

In [42]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Total expenditure"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Total expenditure")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Total Expenditure by Status")
plt.legend()
plt.show()
No description has been provided for this image

Total expenditure predstavlja ukupnu potrošnju države na zdravstvo u procentima. Total expenditure ima jako rasutu distribuciju, i sam po sebi je vrlo loš feature, ima prisutne i high leverage pointove koji ne uticu na Life expectancy. Generalno rečeno, ova promenljiva nema nikakvu prediktivnu moć za rešavanje problema.

DIPTHTHERIA¶

In [50]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Diphtheria"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Diphtheria")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Diphtheria by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Diphtheria predstavlja procentualni broj vakcinisanih 1-godišnjaka. Dolazimo do istih zaključaka kao i za ostale imunološke promenljive (Hepatitis B i Polio).

HIV/AIDS¶

In [51]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["HIV/AIDS"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("HIV/AIDS")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs HIV/AIDS by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva HIV/AIDS predstavlja broj umrle dece od ove bolesti uzrasta 0-4 godine. Posmatrajući grafik odmah uočavamo solidnu negativnu korelaciju u odnosu na ciljnu promenljivu, takodje posmatrajući razvijene države možemo uočiti da razvijene države u potpunosti nemaju niti slute na mogućnost epidemije HIV-a što naznačava da je HIV u potpunosti karakteristika razvijenosti zdravstvenog sistema jedne države. Takodje jasno možemo videti da za sve države koje imaju više od 1% polako ali sigurno očekivani životni vek opada. Ovakva zapažanja direktno pokazuju koliko veliku rolu u proceni očekivanog životnog veka mogu imati bolesti, pošto su one najčešće i reprezentativni faktor zdravstvenog sistema jedne države.

GDP¶

In [ ]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["GDP"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("GDP")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs GDP by Status")
plt.legend()
plt.show()
No description has been provided for this image

GDP (BDP - Bruto domaći proizvod) predstavlja ukupno stvoren domaći dohodak jedne države. Pošto odmah uočavamo jako desno asimetrčnost podataka, radi boljeg razmatranja odmah iscrtavamo ovaj grafik na logaritamskoj skali.

In [53]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["GDP"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xscale("log")
plt.xlabel("GDP")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs GDP by Status")
plt.legend()
plt.show()
No description has been provided for this image

Na grafiku je prisutan klaster u gornjem desnom uglu koji odlikuju razvijene države, što jasno naznačava povezanost sa promenljivom Life expectancy. Iako su podaci za nerazvijene države rasuti svuda po grafiku, uočljiva je pozitivna korelacija s toga u ovoj promenljivoj leži potencijalna predikstorska moć. U svakom slučaju preko grafika je odlikovano da razvijene države imaju veći iznos GDP-a, što najčešće naznačava i posvećenost i brizi stanovništva države kroz njen zdravstveni sistem, s toga možemo reći da iako je GDP ekonomski aspekt jedne države, on se zasigurno indirektno odražava i na medicinski aspekt države. Pored toga možemo smatrati da se povećanjem GDP-a povećava i kvalitet infrastrukture jedne države (ekološki pristup, čist vazduh, sanitacija).

In [31]:
filtered_df_GDP = (
    dataframe.loc[dataframe["GDP"] >= 60000,
                  ["Country","GDP"]]
    .sort_values(by="GDP", ascending=False)
)
filtered_df_GDP
Out[31]:
Country GDP
1539 Luxembourg 119172.74180
1542 Luxembourg 115761.57700
1545 Luxembourg 114293.84330
1540 Luxembourg 113751.85000
1547 Luxembourg 89739.71170
2074 Qatar 88564.82298
2525 Switzerland 87998.44468
1915 Norway 87646.75346
2072 Qatar 86852.71190
2075 Qatar 85948.74600
2522 Switzerland 85814.58857
1918 Norway 85128.65759
2523 Switzerland 84658.88768
2524 Switzerland 83164.38795
2078 Qatar 82967.37228
1549 Luxembourg 75716.35180
2526 Switzerland 74276.71842
1919 Norway 74114.69715
2528 Switzerland 72119.56870
2527 Switzerland 69672.47100
1178 Iceland 68348.31817
114 Australia 67792.33860
115 Australia 67677.63477
1920 Norway 66775.39440
2071 Qatar 66346.52267
1550 Luxembourg 65445.88530
744 Denmark 64322.66640
2529 Switzerland 63223.46778
738 Denmark 62425.53920
116 Australia 62245.12900
113 Australia 62214.69120
741 Denmark 61753.66700
2077 Qatar 61478.23813
1258 Ireland 61388.17457
1257 Ireland 61235.41500
739 Denmark 61191.19263

Posmatranjem GDP-a koji je veći od 60 000, vidimo da ove tačke iako jesu influental points, ne predstavljaju netačne podatke, pošto je GDP za Luxemburg i stvarno toliko visok. U moru ovih niskih podataka za GDP smo sigurni da postoje data error-i u levom stubu, ali je prirodno da za većinu država bude < 15000 . Dolazimo do zaključka da će GDP uz Status igrati veliku ulogu u prediktivnom modelu.

POPULATION¶

In [54]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Population"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Population")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Population by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [58]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Population"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlim(0,40000000)
plt.xlabel("Population")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Population by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [34]:
filtered_df_population = (
    dataframe.loc[dataframe["Population"] >= 1000000000, 
                  ["Country", "Population"]]
    .sort_values(by="Population", ascending=False)
)
filtered_df_population
Out[34]:
Country Population
1187 India 1.293859e+09
1194 India 1.179681e+09
1195 India 1.161978e+09
1196 India 1.144119e+09
1197 India 1.126136e+09
In [35]:
filtered_df_china = dataframe.loc[
    dataframe["Country"] == "China",
    ["Country", "Population"]
]
filtered_df_china
Out[35]:
Country Population
560 China 137122.0
561 China 136427.0
562 China 135738.0
563 China 135695.0
564 China 134413.0
565 China 133775.0
566 China 133126.0
567 China 1324655.0
568 China 1317885.0
569 China 13112.0
570 China 13372.0
571 China 129675.0
572 China 12884.0
573 China 1284.0
574 China 127185.0
575 China 1262645.0

Promenljiva Population predstavlja broj stanovnika jedne države. Population očigledno nema linearne povezanosti sa Life expectancy s toga nećemo preći u šire razmatranje ove promenljive. Podaci od preko 1 milijarde su ocekivani za državu poput Indije, ali i za državu poput Kine, sto je dodatna nelogičnost, ako posmatramo podatke za Kinu, vidimo da su očigledno netacni. Mimo toga, ne možemo zaključiti nikakvu korelisanost sa Life expectancy.

thinness 1-19 years¶

In [ ]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["thinness 10-19 years"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("thinness 1-19 years")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs thinness 1-19 years by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva thinness 1-19 years opisuje učestalost mršavosti medju decom i adolescentima izmedju 10 i 19 godina u procentima (greška u imenovanju kolone pošto nije 1-19 već 10-19) što nam označava BMI koji je ispod referentnih vrednosti, odnosno nedostatak nutritivnih vrednosti u ishrani dece. Može se uociti umerena negativna linearna povezanost sa Life expectancy, sve klastere koji formiraju liniju možemo videti kao entry-je za zasebne drzave, koje prate odredjeni trend neuhranjenosti. Svakako je pristuno da je Life expectancy visok za vrednosti koje su blizu 0, ali vertikalni stub koji se javlja svuda naznacava uticaj drugih socio-ekonomskih faktora koji utiču na očekivani životni vek populacije. Ujedno uočavamo da je raspodela jako slična sa infant deaths i under-five deaths. Takodje je jako uočljiv klaster koji formiraju razvijene države s toga ponovno daju potporu značajnosti Status-a.

THINNESS 5-9 YEARS

In [60]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["thinness 5-9 years"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("thinness 5-9 years")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs thinness 5-9 years by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva thinness 5-9 years opisuje isti pojam kao i thinness 1-19 years (odnosno 10-19) samo je sada posmatran opseg dece starosti 5-9 godina. Prirodno je da uporedimo grafik ove promenljive sa grafikom pomenute promenljive gde dolazimo do zaključka da su raspodele ove dve promenljive praktično identične, s toga je dovoljno da uzmemo bilo koju od ove dve promenljive kao prediktor našeg modela. Posebno je važno da ne odaberemo obe promenljive za naš model kako bismo izbegli multikolinearnost.

INCOME COMPOSITION OF RESOURCES¶

In [61]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Income composition of resources"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Income composition of resources")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Income composition of resources by Status")
plt.legend()
plt.show()
No description has been provided for this image
In [39]:
df_filtered_icr = (
    dataframe.loc[dataframe["Income composition of resources"] < 0.1,
                  ["Country","Income composition of resources"]]
    .sort_values(by="Income composition of resources", ascending=False)
    
)
df_filtered_icr
Out[39]:
Country Income composition of resources
74 Antigua and Barbuda 0.0
2422 South Sudan 0.0
2420 South Sudan 0.0
2419 South Sudan 0.0
2418 South Sudan 0.0
... ... ...
860 Eritrea 0.0
849 Equatorial Guinea 0.0
607 Comoros 0.0
606 Comoros 0.0
2857 Vanuatu 0.0

130 rows × 2 columns

Promenljiva Income composition of resources opisuje razvoj zasnovan na prihodima po stanovniku, koji je normalizovan izmedju 0 i 1. Za data entry-e gde je ova vrednost = 0.0 na osnovu domenskog znanja, dolazimo do zaključka da ove vrednosti opisuju jako slabo razvijene države koje su u potpunoj stagnaciji i ne postoji nikakva naznaka progresa koja bi samim tim mogla da navede i na povećanje očekivanog životnog veka. Mimo toga, vidimo jasnu i jaku pozitivnu povezanost ove promenljive sa Life expectancty-jem, gde high leverage pointovi dostižu čak i ~ 90 godina i to posebno za razvijene države. Ova promenljiva deluje kao siguran kandidat za feature selection.

SCHOOLING¶

In [62]:
plt.figure(figsize=(10, 8))

for status in dataframe["Status"].unique():
    subset = dataframe[dataframe["Status"] == status]
    plt.scatter(
        subset["Schooling"],
        subset["Life expectancy"],
        alpha=0.4,
        label=status
    )

plt.xlabel("Schooling")
plt.ylabel("Life expectancy")
plt.title("Life Expectancy vs Schooling by Status")
plt.legend()
plt.show()
No description has been provided for this image

Promenljiva Schooling predstavlja prosek godina školovanja jedne države. U poredjenju sa Income composition of resources raspodele su praktično identične, i sagledanjem obe promenljive dolazimo do zaključka da su one ozbiljan kandidat za multikolinearnost, pošto nivo školovanja direktno utiče na svest gradjana jedne države a samim tim i na to u šta treba ulagati novac, odupiranje korupciji i slično. Uočavamo prisutnost granice od 10 godina, iznad koje je očekivani životni vek jako visok, poduprene time da su većina takvih država razvijene. S toga ćemo posmatrati samo Schooling pošto iako u suštini opisuju različite pojmove, one su usko povezane.

CISCENJE PODATAKA¶

Prvo proveravamo da li ima duplikata, nemamo duplikate u dataset-u.

In [15]:
dataframe.duplicated().any()
Out[15]:
np.False_

U tabeli ispod prikazan je procenat nedostajućih vrednosti za svaki feature u datasetu

In [16]:
(dataframe.isnull().sum()/dataframe.shape[0]*100).round(2)
Out[16]:
Country                             0.00
Year                                0.00
Status                              0.00
Life expectancy                     0.34
Adult Mortality                     0.34
infant deaths                       0.00
Alcohol                             6.60
percentage expenditure              0.00
Hepatitis B                        18.82
Measles                             0.00
BMI                                 1.16
under-five deaths                   0.00
Polio                               0.65
Total expenditure                   7.69
Diphtheria                          0.65
HIV/AIDS                            0.00
GDP                                15.25
Population                         22.19
thinness 10-19 years                1.16
thinness 5-9 years                  1.16
Income composition of resources     5.68
Schooling                           5.55
dtype: float64

Ovde je i graficki prikazano:

In [17]:
df = (dataframe.isna().mean()*100).round(2)
df = df[df > 0].sort_values()

df.plot(kind="barh", figsize=(8,5), title="Nedostajuće vrednosti (%)")
plt.xlabel("%")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [19]:
missing_data = dataframe.columns[dataframe.isna().any()]

miss = dataframe[missing_data].isna().astype(int)

corr = miss.corr()

results = []
for i in corr.columns:
    for j in corr.columns:
        if i < j:
            r = corr.loc[i, j]
            if r > 0.3:
                results.append((i, j, r))

results.sort(key=lambda x: x[2], reverse=True)

print("Korelacija nedostajucih vrednosti")
print("-" * 70)
for i, j, r in results:
    print(f"{i:32}    {j:32}  r={r:.3f}")
Korelacija nedostajucih vrednosti
----------------------------------------------------------------------
Adult Mortality                     Life expectancy                   r=1.000
BMI                                 thinness 10-19 years              r=1.000
BMI                                 thinness 5-9 years                r=1.000
Diphtheria                          Polio                             r=1.000
thinness 10-19 years                thinness 5-9 years                r=1.000
Income composition of resources     Schooling                         r=0.987
Alcohol                             Total expenditure                 r=0.895
GDP                                 Population                        r=0.744
GDP                                 Schooling                         r=0.559
GDP                                 Income composition of resources   r=0.554
Income composition of resources     Population                        r=0.456
Population                          Schooling                         r=0.454
BMI                                 Polio                             r=0.428
BMI                                 Diphtheria                        r=0.428
Polio                               thinness 10-19 years              r=0.428
Polio                               thinness 5-9 years                r=0.428
Diphtheria                          thinness 10-19 years              r=0.428
Diphtheria                          thinness 5-9 years                r=0.428

Korelacija nedostajućih vrednosti¶

Ova tabela prikazuje korelaciju između nedostajućih vrednosti. U suštini, pokazuje koliko često dva feature-a nemaju podatke u istim redovima.

Iz rezultata se vidi da nedostajuće vrednosti često dolaze u grupama.

  • BMI, thinness 10–19 years i thinness 5–9 years imaju korelaciju r = 1.000. To znači da kada nedostaje jedan od ovih podataka, nedostaju i ostali. MOzemo zakljuciti d potiču iz istog izvora.

  • Slično važi za Adult Mortality i Life expectancy, kao i za Diphtheria i Polio, gde nedostajanje podataka takođe potpuno poklapa. To ukazuje da su ti podaci verovatno preuzeti iz istih izvora.

  • Postoji i jaka korelacija između Income composition of resources i Schooling (r = 0.987), što su socio-ekonomski indikatori. Moguće je da ovi podaci nedostaju za iste zemlje ili godine.

  • Parovi poput GDP ↔ Population (r = 0.744) i Alcohol ↔ Total expenditure (r = 0.895) pokazuju da ekonomske i finansijske metrike često nedostaju zajedno.

Na osnovu ovoga može se zaključiti da nedostajuće vrednosti u datasetu nisu nasumične, već se pojavljuju u grupama povezanih varijabli.

In [20]:
fig, axis = plt.subplots(figsize=(9,7))

heatmap = axis.imshow(corr, cmap="RdYlBu_r", vmin=-1, vmax=1)

axis.set_xticks(range(len(corr.columns)))
axis.set_yticks(range(len(corr.columns)))

axis.set_xticklabels(corr.columns, rotation=45, ha="right")
axis.set_yticklabels(corr.columns)

for i in range(len(corr)):
    for j in range(len(corr)):
        axis.text(j, i, round(corr.iloc[i, j], 2),
                ha="center", va="center", fontsize=7)

plt.colorbar(heatmap)

axis.set_title("Korelacija nedostajućih vrednosti")

plt.show()
No description has been provided for this image

Heatmap vizuelno prikazuje korelaciju nedostajućih vrednosti između feature-a. Tamnije boje (bliže 1) označavaju da dve kolone često nedostaju u istim redovima, dok svetlije boje označavaju slabiju povezanost nedostajanja.

Na heatmapi se jasno uočavaju iste grupe koje smo videli u tabeli, kao što su BMI i thinness varijable, kao i Diphtheria i Polio, koje imaju gotovo identičan obrazac nedostajanja. Ovo potvrđuje da određeni skupovi podataka nedostaju zajedno, verovatno zato što potiču iz istih izvora.

POPULATION

In [21]:
dataframe["Population"].describe()
Out[21]:
count    2.286000e+03
mean     1.275338e+07
std      6.101210e+07
min      3.400000e+01
25%      1.957932e+05
50%      1.386542e+06
75%      7.420359e+06
max      1.293859e+09
Name: Population, dtype: float64

Osnovna statistika – Population¶

Feature Population ima veoma veliki raspon vrednosti. Minimalna vrednost iznosi 34, dok maksimalna dostiže 1.29 milijardi, što pokazuje da dataset obuhvata i veoma male države, ali i najnaseljenije zemlje sveta.

Medijana populacije iznosi oko 1.38 miliona, dok je prosečna vrednost znatno veća (12.7 miliona). Ova razlika ukazuje na jaku desnu asimetriju raspodele, jer nekoliko veoma velikih država značajno povećava prosečnu vrednost.

Takođe, standardna devijacija je veoma visoka (≈61 milion), što dodatno potvrđuje veliku varijabilnost populacije između različitih zemalja u datasetu.

In [22]:
plt.hist(dataframe["Population"].dropna(), bins=40, edgecolor="black",linewidth=1)
plt.title("Distribucija populacije")

plt.show()
No description has been provided for this image

Histogram prikazuje raspodelu vrednosti populacije u datasetu. Na x-osi su opsezi populacije (u milijardama, zbog velike skale), dok y-osa prikazuje koliko zapisa (country–year kombinacija) spada u taj opseg.

Grafik pokazuje izrazitu desnu asimetriju. Većina država ima relativno malu populaciju, dok mali broj veoma velikih država (npr. Kina i Indija) značajno povećava opseg vrednosti i stvara dugačak rep na desnoj strani raspodele.

In [23]:
pop = pd.to_numeric(dataframe["Population"], errors="coerce")

print(pop.describe())
print("Missing %:", pop.isna().mean()*100)

plt.figure(figsize=(6,4))
plt.hist(np.log10(pop.dropna()), bins=40, edgecolor="black",linewidth=1)
plt.title("Population distribution (log10 scale)")
plt.xlabel("log10(Population)")
plt.ylabel("Count")
plt.show()
count    2.286000e+03
mean     1.275338e+07
std      6.101210e+07
min      3.400000e+01
25%      1.957932e+05
50%      1.386542e+06
75%      7.420359e+06
max      1.293859e+09
Name: Population, dtype: float64
Missing %: 22.19196732471069
No description has been provided for this image

Pošto populacija ima veoma veliki raspon vrednosti (od nekoliko desetina do više od milijardu), običan histogram je teško čitljiv jer nekoliko veoma velikih država dominira skalom.

Zato se koristi log10 transformacija. Ona “sabija” velike vrednosti i širi male, pa raspodela postaje preglednija. Na taj način lakše vidimo kako su zemlje raspoređene po veličini populacije, bez da ekstremno velike države potpuno razvuku grafikon.

In [25]:
bad = (pop <= 0)
print("Broj redova sa vrednoscu 0:", bad.sum())
Broj redova sa vrednoscu 0: 0
In [26]:
miss_by_year = dataframe.groupby("Year")["Population"].apply(lambda s: s.isna().mean())

plt.figure(figsize=(7,3))
plt.plot(miss_by_year.index, miss_by_year.values, marker="o")
plt.title("Population missing rate by year")
plt.xlabel("Year")
plt.ylabel("Missing rate")
plt.show()
No description has been provided for this image
In [27]:
miss_by_country = dataframe.groupby("Country")["Population"].apply(lambda s: s.isna().mean()).sort_values(ascending=False)

plt.figure(figsize=(8,4))
miss_by_country.head(20).plot(kind="bar",  edgecolor="black", linewidth=1)
plt.title("Raspodela procenta nedostajućih vrednosti populacije po državama")
plt.ylabel("Procenat nedostajućih vrednosti")
plt.show()

full_missing = miss_by_country[miss_by_country == 1.0].index.tolist()
print("Broj država kojima populacija potpuno nedostaje:", len(full_missing))
print("Prvih 30 država:", full_missing[:30])
No description has been provided for this image
Broj država kojima populacija potpuno nedostaje: 48
Prvih 30 država: ['Antigua and Barbuda', 'Dominica', 'Barbados', 'Bahrain', 'Bahamas', 'Brunei Darussalam', 'Bolivia (Plurinational State of)', 'Gambia', 'Egypt', 'Democratic Republic of the Congo', 'Cuba', "Côte d'Ivoire", "Democratic People's Republic of Korea", 'Congo', 'Czechia', 'Cook Islands', 'The former Yugoslav republic of Macedonia', 'United States of America', 'United Republic of Tanzania', 'United Kingdom of Great Britain and Northern Ireland', 'Marshall Islands', 'Niue', 'Oman', 'Nauru', 'New Zealand', 'Micronesia (Federated States of)', 'Monaco', 'Kyrgyzstan', 'Kuwait', 'Libya']

Za neke države Population nedostaje u 100% redova. U tim slučajevima ne možemo da radimo interpolaciju, jer ne postoji nijedna poznata vrednost kroz godine. Takođe nema smisla popunjavati mean/median iz drugih država, jer populacija jedne države nema veze sa populacijom druge i takva imputacija bi bila proizvoljna.

Najverovatnije je problem u nazivima država pri spajanju podataka (npr. različite verzije imena kao “Czechia” vs “Czech Republic”, “Bolivia (Plurinational State of)” itd.), pa se vrednosti nisu poklopile. Zbog toga ćemo Population popuniti korišćenjem drugog dataset-a sa populacijom i spojiti ga sa ovim podacima.

In [28]:
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])

prev = df.groupby("Country")["Population"].shift(1)
df["Population growth"] = (df["Population"] - prev) / prev

extreme = df["Population growth"].abs().sort_values(ascending=False).head(20)
print(df.loc[extreme.index, ["Country","Year","Population","Population growth"]].to_string(index=False))
               Country  Year  Population  Population growth
               Hungary  2011   9971727.0       81069.951220
              Ethiopia  2008  83184892.0       10206.987729
                  Iraq  2015  36115649.0       10121.098935
              Maldives  2015     49163.0        1198.097561
                 Benin  2005   7982225.0        1028.433196
              Cameroon  2009  19432541.0        1022.950943
               Burundi  2001   6555829.0        1011.326899
          Turkmenistan  2003   4655741.0        1008.484172
                  Peru  2010  29373646.0        1006.430325
             Nicaragua  2002   5171734.0         998.368889
Bosnia and Herzegovina  2008   3763599.0         996.244038
                  Mali  2013  16477818.0         987.649307
               Uruguay  2014   3419546.0         980.218364
              Pakistan  2005  15399667.0         974.712285
  Syrian Arab Republic  2015  18734987.0         972.802537
                Turkey  2009  71339185.0         957.447778
            Tajikistan  2007   7152385.0         945.458251
               Germany  2015  81686611.0         908.397284
                Bhutan  2009    714458.0         897.689308
                  Chad  2006   1421597.0         845.692674

Ovo računa godišnji rast populacije po državama u odnosu na prethodnu godinu: (pop - prethodna) / prethodna.

U izlazu se pojavljuju ekstremne vrednosti (npr. 1000x, 10000x...), što nije realno za promenu populacije u jednoj godini. Najverovatnije znači da je prethodna vrednost bila pogrešno upisana ili da nedostaje podatak za tu godinu, pa račun daje ogroman skok. Zbog toga ove redove posmatramo kao potencijalne greške u podacima i ne uzimamo ih zdravo za gotovo bez dodatne provere.

In [29]:
g = df[df["Country"] == "Hungary"][["Year","Population"]].sort_values("Year")
print(g.to_string(index=False))

g2 = df[df["Country"] == "Hungary"][["Year","Population","Population growth"]].sort_values("Year")
print(g2.to_string(index=False))
 Year  Population
 2000    121971.0
 2001   1187576.0
 2002    115868.0
 2003   1129552.0
 2004    117146.0
 2005     18765.0
 2006     17137.0
 2007     15578.0
 2008    138188.0
 2009     12265.0
 2010       123.0
 2011   9971727.0
 2012    992362.0
 2013    989382.0
 2014   9866468.0
 2015    984328.0
 Year  Population  Population growth
 2000    121971.0                NaN
 2001   1187576.0           8.736544
 2002    115868.0          -0.902433
 2003   1129552.0           8.748610
 2004    117146.0          -0.896290
 2005     18765.0          -0.839815
 2006     17137.0          -0.086757
 2007     15578.0          -0.090973
 2008    138188.0           7.870715
 2009     12265.0          -0.911244
 2010       123.0          -0.989971
 2011   9971727.0       81069.951220
 2012    992362.0          -0.900482
 2013    989382.0          -0.003003
 2014   9866468.0           8.972354
 2015    984328.0          -0.900235
In [134]:
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"], errors="coerce")

c = "Hungary"
g = df[df["Country"]==c][["Year","Population"]].sort_values("Year")

plt.figure(figsize=(9,4))
plt.plot(g["Year"], g["Population"], marker="o")
plt.title("Hungary — Population by Year (corrupted scale jumps)")
plt.xlabel("Year")
plt.ylabel("Population")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image
In [46]:
for state in ["Hungary","Luxembourg","Maldives","Germany","India"]:
    g = dataframe[dataframe["Country"] == state].sort_values("Year")

    y = g["Population"]
    plt.figure(figsize=(8,3))
    plt.plot(g["Year"], y, marker="o")
    plt.title(f"{state} — Population over time")
    plt.xlabel("Year")
    plt.ylabel("Population")
    plt.grid(True, alpha=0.3)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Ovi grafici izgledaju loše za populaciju: vide se ogromni skokovi i padovi skoro na nulu u jednoj godini, što nema smisla za realnu populaciju (populacija ne može da ima takve oscilacije). To nam govori da su podaci pogrešni ili loše popunjeni (npr. neke godine su 0, pa posle dođe prava vrednost i izgleda kao ekstreman rast tj. pad).

In [32]:
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])
prev = df.groupby("Country")["Population"].shift(1)
df["pop_growth"] = (df["Population"] - prev) / prev

top = df[df["pop_growth"].notna()].copy()
top["abs_growth"] = top["pop_growth"].abs()

print(top.sort_values("abs_growth", ascending=False)[["Country","Year","Population","pop_growth"]].to_string(index=False))
                 Country  Year   Population   pop_growth
                 Hungary  2011 9.971727e+06 81069.951220
                Ethiopia  2008 8.318489e+07 10206.987729
                    Iraq  2015 3.611565e+07 10121.098935
                Maldives  2015 4.916300e+04  1198.097561
                   Benin  2005 7.982225e+06  1028.433196
                Cameroon  2009 1.943254e+07  1022.950943
                 Burundi  2001 6.555829e+06  1011.326899
            Turkmenistan  2003 4.655741e+06  1008.484172
                    Peru  2010 2.937365e+07  1006.430325
               Nicaragua  2002 5.171734e+06   998.368889
  Bosnia and Herzegovina  2008 3.763599e+06   996.244038
                    Mali  2013 1.647782e+07   987.649307
                 Uruguay  2014 3.419546e+06   980.218364
                Pakistan  2005 1.539967e+07   974.712285
    Syrian Arab Republic  2015 1.873499e+07   972.802537
                  Turkey  2009 7.133918e+07   957.447778
              Tajikistan  2007 7.152385e+06   945.458251
                 Germany  2015 8.168661e+07   908.397284
                  Bhutan  2009 7.144580e+05   897.689308
                    Chad  2006 1.421597e+06   845.692674
                 Armenia  2005 2.981259e+06   824.376246
                 Romania  2013 1.998369e+07   772.512406
              Mozambique  2006 2.154746e+07   735.992954
                 Senegal  2004 1.955944e+06   115.432169
                  Cyprus  2005 1.276580e+05   110.882559
                   India  2001 1.714779e+08   110.645625
                 Morocco  2006 3.869346e+06   108.871539
             South Sudan  2008 9.263136e+06   103.587842
             Philippines  2015 1.171636e+07   103.378293
                 Myanmar  2010 5.155896e+06   102.388799
                   Niger  2003 1.265687e+06   102.220274
             South Sudan  2001 6.974442e+06   102.086822
                 Senegal  2014 1.454611e+07   101.994442
             Afghanistan  2015 3.373649e+07   101.986410
                  Zambia  2011 1.426476e+07   101.970094
            Burkina Faso  2003 1.265462e+07   101.940845
                 Nigeria  2015 1.811817e+08   101.672790
                  Zambia  2003 1.142198e+07   101.670442
                   Ghana  2006 2.211342e+07   101.648320
                   Sudan  2002 2.867956e+07   101.626774
                Paraguay  2001 5.466240e+05   101.613854
                 Senegal  2012 1.373513e+06   101.569860
                 Nigeria  2001 1.254634e+08   101.542264
                 Comoros  2014 7.593850e+05   101.412003
   Sao Tome and Principe  2012 1.828890e+05   101.286913
                   Kenya  2015 4.723626e+07   101.149017
              Azerbaijan  2009 8.947243e+06   101.097850
               Guatemala  2004 1.279692e+07   100.985408
                   Sudan  2007 3.228253e+07   100.913494
                  Israel  2003 6.689700e+04   100.821918
               Australia  2012 2.272825e+07   100.727003
                   Haiti  2001 8.692567e+06   100.676964
Central African Republic  2008 4.345386e+06   100.627438
                Cambodia  2015 1.551764e+07   100.569162
              Bangladesh  2004 1.413749e+07   100.568988
                Cambodia  2006 1.347449e+07   100.524921
                 Algeria  2008 3.486715e+06   100.428758
              Kazakhstan  2012 1.679142e+07   100.418317
                Zimbabwe  2006 1.312427e+07   100.398935
                  Greece  2004 1.955141e+06   100.370923
              Uzbekistan  2001 2.496445e+06   100.259228
              Uzbekistan  2006 2.648825e+06   100.227691
              Bangladesh  2007 1.471392e+08   100.218140
                    Peru  2006 2.794994e+07   100.216205
                  Jordan  2001 5.193482e+06   100.211818
              Luxembourg  2001 4.415250e+05   100.197570
                  Turkey  2015 7.827147e+07   100.174559
                  Bhutan  2012 7.529670e+05   100.055831
                Cambodia  2011 1.453789e+07   100.045957
            South Africa  2005 4.766672e+05   100.031846
                 Myanmar  2003 4.762489e+07   100.023911
                Paraguay  2005 5.795494e+06   100.012549
                  Canada  2007 3.288793e+07    99.959089
              Cabo Verde  2014 5.264370e+05    99.927339
                   Kenya  2008 3.914842e+07    99.742452
             Switzerland  2004 7.389625e+06    99.688436
                Zimbabwe  2003 1.263390e+07    99.648452
              Kazakhstan  2007 1.548419e+07    99.622495
               Mauritius  2005 1.228254e+06    99.569393
                  Brazil  2003 1.824821e+08    99.534427
                   China  2007 1.317885e+06    99.509838
     Trinidad and Tobago  2011 1.334788e+06    99.503577
             El Salvador  2003 5.971535e+06    99.475073
                Thailand  2012 6.784398e+07    99.463013
             Netherlands  2015 1.693992e+07    99.439487
             El Salvador  2009 6.137276e+06    99.395479
              Azerbaijan  2005 8.391850e+05    99.320980
                  Greece  2001 1.862132e+06    99.179255
              Costa Rica  2012 4.654122e+06    99.144640
Central African Republic  2010 4.448525e+06    99.140130
                  Poland  2008 3.812576e+07    99.000417
                 Croatia  2008 4.434580e+05    98.967989
               Mauritius  2011 1.252440e+05    98.875598
              Costa Rica  2014 4.757575e+06    98.863038
                  Poland  2004 3.818222e+07    98.834026
        Papua New Guinea  2004 6.161517e+06    98.823683
                 Belgium  2015 1.127420e+07    98.809627
                 Estonia  2009 1.334515e+06    98.746992
               Nicaragua  2005 5.379328e+06    98.667019
                  Greece  2010 1.112134e+07    98.549227
                 Ukraine  2014 4.527195e+07    98.521532
                   Italy  2003 5.731323e+06    98.507318
      Russian Federation  2003 1.446483e+08    98.507209
                  Serbia  2010 7.291436e+06    98.491533
                    Iraq  2006 2.769791e+07    98.480336
                 Myanmar  2005 4.848261e+07    98.476615
                  Guinea  2012 1.128147e+07    98.381317
                   Japan  2011 1.278330e+05    98.326340
                Portugal  2006 1.522288e+06    98.281810
      Russian Federation  2005 1.435185e+08    98.200364
                Pakistan  2011 1.741843e+08    98.183493
                 Albania  2008 2.947314e+06    98.179392
  Bosnia and Herzegovina  2015 3.535961e+06    98.152067
                 Ukraine  2003 4.781295e+06    98.145568
                Zimbabwe  2014 1.541168e+07    98.138502
                Bulgaria  2007 7.545338e+06    98.121647
               Swaziland  2003 1.873920e+05    97.992076
            Burkina Faso  2012 1.657122e+07    97.524418
                 Albania  2013 2.895920e+05    97.467188
                Honduras  2010 8.194778e+06    97.116378
             Netherlands  2002 1.614893e+07    97.099412
                  Malawi  2013 1.657715e+07    96.664872
                Ethiopia  2012 9.244418e+07    96.643092
                  Latvia  2008 2.177322e+06    96.528421
    Syrian Arab Republic  2003 1.741527e+07    96.405720
                 Armenia  2009 2.888584e+06    95.860841
                 Ecuador  2003 1.328961e+06    95.820705
                 Albania  2003 3.396160e+05    95.729137
              Kazakhstan  2010 1.632158e+07    95.422784
                  Mexico  2013 1.225360e+08    94.519516
                   India  2004 1.126136e+09    94.210538
     Trinidad and Tobago  2008 1.315372e+06    93.454402
                  Mexico  2007 1.118363e+08    92.792695
             El Salvador  2005 6.289610e+05    91.835572
                  Sweden  2007 9.148920e+05    91.835312
                 Georgia  2009 3.978000e+03    91.511628
                Djibouti  2008 8.229340e+05    91.030195
              Mauritania  2015 4.182341e+06    89.152203
              Costa Rica  2003 4.125971e+06    88.067676
                   Benin  2014 1.286712e+06    88.039651
                 Liberia  2012 4.181563e+06    87.654419
                  Uganda  2008 3.166390e+07    87.080782
                 Ireland  2005 4.159914e+06    87.018154
                Bulgaria  2002 7.837161e+06    86.917716
              Cabo Verde  2012 5.139790e+05    86.605079
                Suriname  2008 5.151480e+05    85.217238
               Argentina  2010 4.122389e+07    84.892586
                   Spain  2002 4.143156e+07    84.353386
                Colombia  2002 4.157249e+07    82.328471
    Syrian Arab Republic  2013 1.989141e+06    80.955461
                 Vanuatu  2006 2.146340e+05    72.079333
                    Chad  2008 1.113386e+07    61.698425
                    Mali  2001 1.129326e+07    56.393482
                  Greece  2013 1.965211e+06    16.161766
                  Rwanda  2012 1.788853e+06    10.794298
                 Eritrea  2006 4.666480e+05    10.755246
                  Panama  2001 3.896840e+05    10.685729
                Malaysia  2015 3.723155e+06    10.533330
                  Bhutan  2002 6.639900e+04    10.261703
                  Cyprus  2007 1.637120e+05    10.244728
                   Haiti  2013 1.431776e+06    10.105840
      Dominican Republic  2013 1.281296e+06    10.093952
                 Belgium  2008 1.799730e+05    10.070493
                  Guinea  2015 1.291533e+06     9.893589
                  Mexico  2004 1.699558e+07     9.863594
                Slovenia  2006 2.686800e+04     9.860146
              Kazakhstan  2009 1.692710e+05     9.799477
                 Myanmar  2012 5.986514e+06     9.780083
                  Brazil  2014 2.421313e+07     9.767939
                 Lesotho  2011 2.641660e+05     9.759888
                  Mexico  2012 1.282837e+06     9.697708
                 Lebanon  2007 4.864660e+05     9.636624
                 Tunisia  2006 1.196136e+06     9.634021
                  Jordan  2011 7.574943e+06     9.546549
                  Mexico  2002 1.435568e+06     9.496373
      Dominican Republic  2015 1.528394e+06     9.479649
       Equatorial Guinea  2003 6.946110e+05     9.422240
                 Lebanon  2004 3.863267e+06     9.400111
                 Lebanon  2015 5.851479e+06     9.388243
                   Niger  2007 1.466834e+07     9.379050
              Montenegro  2001 6.738900e+04     9.375520
                   Niger  2005 1.361845e+07     9.374285
                   Tonga  2010 1.413700e+04     9.364370
                  Angola  2009 2.254955e+07     9.363120
               Indonesia  2003 2.254521e+07     9.361523
             South Sudan  2004 7.787655e+06     9.360857
                 Belgium  2002 1.332785e+06     9.359211
                   Kenya  2007 3.885990e+05     9.355736
                  Uganda  2004 2.756844e+07     9.354412
                   Gabon  2012 1.756817e+06     9.351816
                 Liberia  2006 3.375838e+06     9.351426
                  Angola  2004 1.886572e+07     9.346625
             South Sudan  2006 8.468152e+06     9.341177
                    Mali  2008 1.413822e+07     9.338233
                    Chad  2011 1.228865e+07     9.337700
                   Gabon  2011 1.697110e+05     9.334998
                    Iraq  2012 3.277657e+07     9.330641
                 Tunisia  2009 1.521834e+06     9.329003
                    Mali  2005 1.279876e+07     9.328280
                  Uganda  2013 3.755373e+07     9.326047
                  Angola  2001 1.698327e+07     9.324651
              Seychelles  2002 8.372300e+04     9.308175
                 Eritrea  2001 3.497124e+06     9.307456
            Burkina Faso  2008 1.468973e+07     9.306981
              Madagascar  2002 1.676512e+07     9.304744
                  Malawi  2008 1.427123e+07     9.304371
            Burkina Faso  2006 1.382918e+07     9.303419
                  Malawi  2011 1.562762e+07     9.303052
                    Chad  2013 1.313359e+07     9.299764
                  Jordan  2015 9.159320e+05     9.298777
             South Sudan  2015 1.188214e+07     9.296737
                   Malta  2007 4.672400e+04     9.296166
              Mozambique  2012 2.567666e+06     9.295579
              Mozambique  2013 2.643437e+07     9.295098
              Madagascar  2007 1.943352e+07     9.291719
             Timor-Leste  2004 9.966980e+05     9.290939
              Mauritania  2006 3.226530e+05     9.284744
                   Benin  2010 9.199259e+06     9.284523
                 Senegal  2010 1.291623e+07     9.284301
              Madagascar  2005 1.833672e+07     9.284215
                 Burundi  2014 9.891790e+05     9.284023
              Madagascar  2011 2.174395e+07     9.280030
             Afghanistan  2007 2.661679e+07     9.279353
                   Kenya  2003 3.413852e+06     9.278074
                   Benin  2012 9.729160e+05     9.275617
                   Kenya  2001 3.232148e+07     9.275523
                   Kenya  2011 4.248684e+07     9.274553
                    Togo  2002 5.251472e+06     9.273295
              Mauritania  2002 2.873228e+06     9.271470
                 Nigeria  2012 1.672973e+08     9.271340
                Cameroon  2006 1.789956e+07     9.270607
         Solomon Islands  2001 4.238530e+05     9.270494
                 Nigeria  2010 1.585783e+08     9.269162
                Cameroon  2015 2.283452e+07     9.267349
              Mauritania  2007 3.312665e+06     9.266959
                    Togo  2004 5.534598e+06     9.265585
                   Gabon  2003 1.328146e+06     9.259994
                   Ghana  2002 1.992452e+07     9.258923
           Guinea-Bissau  2011 1.596154e+06     9.258850
                 Nigeria  2003 1.319725e+08     9.256929
                   Ghana  2008 2.329864e+06     9.254142
                Honduras  2002 6.863157e+06     9.253297
                    Iraq  2008 2.911142e+07     9.252546
                   Ghana  2011 2.512180e+07     9.248716
                 Vanuatu  2002 1.939560e+05     9.246500
                Honduras  2012 8.556460e+05     9.245294
                   Tonga  2006 1.168900e+04     9.244522
                 Comoros  2008 6.572290e+05     9.243275
             Philippines  2003 8.331954e+06     9.241780
        Papua New Guinea  2007 6.627922e+06     9.239779
    Syrian Arab Republic  2005 1.829461e+07     9.239685
             Timor-Leste  2001 8.925310e+05     9.239322
             Timor-Leste  2013 1.184366e+06     9.238649
              Luxembourg  2014 5.563190e+05     9.238497
                   Sudan  2013 3.684992e+07     9.238386
                   Ghana  2013 2.634625e+07     9.238118
   Sao Tome and Principe  2006 1.593280e+05     9.237615
            Sierra Leone  2011 6.611692e+06     9.236846
                  Belize  2011 3.291920e+05     9.233524
   Sao Tome and Principe  2008 1.669130e+05     9.233156
                 Liberia  2015 4.499621e+06     9.232528
            Sierra Leone  2010 6.458720e+05     9.231474
                  Belize  2005 2.832770e+05     9.230669
               Guatemala  2001 1.192495e+07     9.229481
                    Togo  2011 6.679282e+06     9.229361
              Tajikistan  2011 7.815949e+06     9.228118
            Burkina Faso  2001 1.194459e+07     9.227038
              Tajikistan  2014 8.362745e+06     9.226042
         Solomon Islands  2011 5.396140e+05     9.224028
                  Belize  2013 3.441810e+05     9.221882
                  Greece  2007 1.148473e+06     9.221187
                Kiribati  2007 9.631100e+04     9.217590
                   Sudan  2010 3.438596e+07     9.216832
               Guatemala  2011 1.494892e+07     9.215078
                  Bhutan  2005 6.566390e+05     9.214975
                Pakistan  2002 1.446541e+08     9.214651
   Sao Tome and Principe  2001 1.416220e+05     9.213616
               Guatemala  2013 1.559621e+07     9.212587
    Syrian Arab Republic  2001 1.676690e+07     9.212211
                Malaysia  2002 2.419881e+07     9.210913
        Papua New Guinea  2013 7.592865e+06     9.207714
         Solomon Islands  2015 5.874820e+05     9.207492
   Sao Tome and Principe  2014 1.912660e+05     9.203574
                    Chad  2003 9.353210e+05     9.201016
                  Rwanda  2009 9.977446e+06     9.200125
             Afghanistan  2004 2.411898e+07     9.198942
                  Guinea  2005 9.679745e+06     9.197481
                  Rwanda  2007 9.447420e+05     9.196011
                 Namibia  2011 2.215621e+06     9.195341
           Guinea-Bissau  2009 1.517448e+06     9.195094
                Maldives  2010 3.670000e+02     9.194444
                Malaysia  2005 2.565939e+07     9.192738
                 Namibia  2015 2.425561e+06     9.191775
                Mongolia  2014 2.923896e+06     9.190738
              Seychelles  2008 8.695600e+04     9.190554
                  Uganda  2001 2.485489e+07     9.189463
                Botswana  2014 2.168573e+06     9.187934
              Luxembourg  2009 4.977830e+05     9.186903
                 Algeria  2010 3.611764e+07     9.183805
              Seychelles  2013 8.994900e+04     9.183290
              Cabo Verde  2001 4.437160e+05     9.181877
                  Panama  2008 3.516268e+06     9.180661
            Turkmenistan  2012 5.267839e+06     9.180166
                Mongolia  2011 2.761516e+06     9.180141
              Tajikistan  2001 6.327125e+06     9.178363
           Guinea-Bissau  2006 1.412669e+06     9.174945
                 Lebanon  2009 4.183156e+06     9.174356
                   Nepal  2001 2.416178e+07     9.173761
                 Eritrea  2011 4.474690e+05     9.173449
                Zimbabwe  2008 1.355847e+07     9.171402
                Djibouti  2010 8.511460e+05     9.170953
                Cambodia  2002 1.263473e+07     9.169017
                 Namibia  2010 2.173170e+05     9.167353
             Philippines  2012 9.686664e+07     9.166744
                Malaysia  2010 2.811229e+07     9.165785
                Kiribati  2001 8.585800e+04     9.165522
                  Malawi  2003 1.233669e+07     9.164435
             Philippines  2004 8.467849e+07     9.163101
                 Ecuador  2011 1.517736e+07     9.162484
             Philippines  2010 9.372662e+07     9.162404
              Uzbekistan  2009 2.776740e+05     9.160787
                  Rwanda  2004 8.818438e+06     9.155443
                Botswana  2001 1.754935e+06     9.153876
                  Mexico  2010 1.173189e+08     9.152888
                  Turkey  2001 6.419147e+07     9.150260
                   Haiti  2010 9.999617e+06     9.148938
                  Cyprus  2002 9.769660e+05     9.146923
                  Turkey  2012 7.456987e+07     9.146312
                Botswana  2005 1.855852e+06     9.144982
              Costa Rica  2005 4.247841e+06     9.144389
            South Africa  2013 5.331196e+07     9.142071
                Mongolia  2008 2.628131e+06     9.140685
             Afghanistan  2010 2.883167e+06     9.140178
                 Algeria  2005 3.328844e+07     9.139034
               Australia  2015 2.378934e+07     9.137384
               Indonesia  2007 2.329891e+08     9.137093
                 Iceland  2005 2.967340e+05     9.136435
      Dominican Republic  2009 9.767758e+06     9.136188
                 Ecuador  2008 1.444756e+07     9.135418
              Luxembourg  2005 4.651580e+05     9.135265
                  Mexico  2015 1.258995e+07     9.135072
            South Africa  2007 4.888384e+07     9.134760
                Paraguay  2014 6.552584e+06     9.134314
              Cabo Verde  2004 4.676640e+05     9.134223
                 Ecuador  2015 1.614437e+07     9.133856
            Turkmenistan  2010 5.872100e+04     9.133046
              Azerbaijan  2012 9.295784e+06     9.132948
               Indonesia  2010 2.425241e+08     9.132835
              Costa Rica  2009 4.488263e+06     9.132480
                 Iceland  2001 2.849680e+05     9.132196
            Turkmenistan  2008 4.935762e+06     9.132185
                  Israel  2009 7.485600e+04     9.132106
                 Algeria  2001 3.159215e+07     9.130996
               Nicaragua  2009 5.666581e+06     9.128734
               Indonesia  2012 2.488832e+08     9.126462
            Turkmenistan  2007 4.871370e+05     9.124431
               Australia  2002 1.965140e+05     9.122804
                  Turkey  2006 6.876345e+06     9.122007
                    Peru  2003 2.693774e+07     9.121387
                  Turkey  2007 6.959728e+07     9.121261
        Papua New Guinea  2011 7.269348e+06     9.121071
                   Nepal  2014 2.832324e+07     9.120753
                   India  2014 1.293859e+09     9.119642
                Honduras  2008 7.872658e+06     9.119462
              Azerbaijan  2015 9.649341e+06     9.119079
            South Africa  2006 4.823384e+06     9.118976
                Honduras  2013 8.657785e+06     9.118419
              Bangladesh  2013 1.575713e+08     9.118397
               Swaziland  2007 1.138434e+06     9.118154
      Dominican Republic  2005 9.237566e+06     9.117838
               Nicaragua  2013 5.945747e+06     9.116666
                 Tunisia  2015 1.127366e+07     9.116369
              Bangladesh  2011 1.539119e+08     9.115860
                  Canada  2013 3.515545e+07     9.115090
             Switzerland  2011 7.912398e+06     9.111704
                 Morocco  2011 3.285882e+07     9.111530
                Colombia  2009 4.541618e+07     9.111485
               Indonesia  2014 2.551311e+08     9.111305
                  Cyprus  2011 1.124835e+06     9.109332
                Suriname  2005 4.989460e+05     9.107692
                   Chile  2006 1.631979e+07     9.106611
                  Sweden  2015 9.799186e+06     9.106307
                Suriname  2011 5.315890e+05     9.103758
                   Tonga  2009 1.364000e+03     9.103704
                   Nepal  2007 2.621485e+07     9.103548
                   Haiti  2009 9.852870e+05     9.102503
                    Fiji  2009 8.519670e+05     9.102296
              Tajikistan  2009 7.472819e+06     9.102117
                 Ecuador  2005 1.373523e+07     9.102058
                Suriname  2003 4.883320e+05     9.101191
                  Brazil  2009 1.948960e+08     9.099322
                Suriname  2014 5.479280e+05     9.099311
             Afghanistan  2001 2.966463e+06     9.098391
                  Guinea  2001 8.971139e+06     9.096426
                   Gabon  2006 1.444844e+06     9.094909
                Thailand  2001 6.354332e+07     9.092937
              Kazakhstan  2004 1.512985e+06     9.092084
                Colombia  2012 4.688148e+07     9.089315
                   Spain  2009 4.636295e+07     9.088955
                  Rwanda  2001 8.329460e+05     9.087389
               Argentina  2004 3.872870e+07     9.087229
              Seychelles  2007 8.533000e+03     9.086288
                  Canada  2006 3.257550e+05     9.081549
                 Tunisia  2002 9.864326e+06     9.080338
                  France  2010 6.527512e+06     9.077302
                  Sweden  2008 9.219637e+06     9.077295
                 Lesotho  2004 1.933728e+06     9.076906
                   Samoa  2015 1.937590e+05     9.076395
                   Samoa  2011 1.876650e+05     9.075973
               Sri Lanka  2007 1.966800e+04     9.075820
              Azerbaijan  2002 8.171950e+05     9.074896
                  Panama  2015 3.969249e+06     9.074594
                Djibouti  2005 7.832540e+05     9.073747
                  Cyprus  2013 1.143896e+06     9.072876
                    Fiji  2015 8.921490e+05     9.070993
                  France  2005 6.317936e+07     9.068589
                 Jamaica  2002 2.695446e+06     9.068492
                   Malta  2002 3.959690e+05     9.068374
                  Norway  2005 4.623291e+06     9.068340
                   Italy  2008 5.882673e+07     9.066467
                    Fiji  2012 8.735960e+05     9.066094
                   Haiti  2007 9.556889e+06     9.065636
                   Italy  2004 5.768533e+07     9.064923
                   China  2004 1.296750e+05     9.064809
                  Guyana  2012 7.539100e+04     9.064210
             Switzerland  2001 7.229854e+06     9.063478
                  France  2003 6.224488e+07     9.063411
                 Nigeria  2009 1.544218e+07     9.061764
                Thailand  2006 6.582416e+07     9.060939
                  Guyana  2014 7.633930e+05     9.060397
                  Sweden  2014 9.696110e+05     9.060397
              Mauritania  2011 3.717672e+06     9.060188
                  Israel  2005 6.931000e+03     9.059507
                   India  2011 1.247236e+08     9.059419
                Slovenia  2015 2.635310e+05     9.059203
                   Samoa  2001 1.755660e+05     9.054751
                  France  2014 6.633196e+07     9.050514
                Thailand  2009 6.688187e+07     9.050508
             Netherlands  2010 1.661539e+07     9.049301
            Burkina Faso  2005 1.342193e+06     9.048686
                   Niger  2012 1.773163e+07     9.048324
                 Finland  2008 5.313399e+06     9.046663
             El Salvador  2012 6.221246e+06     9.046323
                 Jamaica  2007 2.775467e+06     9.045885
             Afghanistan  2006 2.589345e+06     9.044085
            Turkmenistan  2005 4.754641e+06     9.043644
                 Jamaica  2011 2.829493e+06     9.043600
                 Denmark  2010 5.547683e+06     9.042964
     Trinidad and Tobago  2002 1.277837e+06     9.042888
                   Japan  2008 1.286300e+04     9.041374
                   Malta  2011 4.162680e+05     9.040716
                Thailand  2014 6.841677e+07     9.040080
                   Chile  2013 1.746298e+07     9.037662
                 Ireland  2001 3.866243e+06     9.037653
             El Salvador  2014 6.281189e+06     9.037424
                 Finland  2006 5.266268e+06     9.036798
             Netherlands  2012 1.675496e+07     9.036674
               Mauritius  2008 1.244121e+06     9.036229
                   Nepal  2011 2.732715e+07     9.035172
             Netherlands  2004 1.628178e+07     9.034797
                   China  2003 1.288400e+04     9.034268
                 Austria  2011 8.391643e+06     9.033722
                 Ireland  2011 4.576794e+06     9.033418
                  Sweden  2002 8.924958e+06     9.032597
                 Jamaica  2015 2.871934e+06     9.031661
                  Norway  2001 4.513751e+06     9.031293
                Thailand  2005 6.542547e+06     9.031027
                 Finland  2004 5.228172e+06     9.028835
                 Denmark  2012 5.591572e+06     9.028430
               Mauritius  2012 1.255882e+06     9.027482
                 Uruguay  2007 3.339741e+06     9.024947
                  Malawi  2006 1.342926e+07     9.023999
             Netherlands  2007 1.638170e+07     9.021770
                   Samoa  2007 1.822860e+05     9.019017
            Sierra Leone  2008 6.165372e+06     9.018202
      Russian Federation  2014 1.438197e+08     9.017452
                 Germany  2001 8.234992e+07     9.016828
                  Angola  2013 2.599834e+06     9.014190
Central African Republic  2003 3.981665e+06     9.013946
                 Uruguay  2009 3.362755e+06     9.013445
     Trinidad and Tobago  2005 1.296934e+06     9.012228
Central African Republic  2013 4.499653e+06     9.012223
  Bosnia and Herzegovina  2001 3.771284e+06     9.012010
                  France  2011 6.534278e+07     9.010365
              Montenegro  2007 6.158750e+05     9.010158
                Slovenia  2002 1.994530e+05     9.009686
  Bosnia and Herzegovina  2003 3.779247e+06     9.008944
                Slovenia  2003 1.995733e+06     9.006031
              Montenegro  2015 6.221590e+05     9.005613
  Bosnia and Herzegovina  2013 3.649990e+05     9.004907
      Russian Federation  2009 1.427853e+08     9.003012
                 Uruguay  2005 3.325612e+06     9.001961
                 Uruguay  2002 3.327773e+06     9.001933
               Nicaragua  2012 5.877180e+05     8.998265
                 Croatia  2004 4.439000e+03     8.997748
  Bosnia and Herzegovina  2006 3.779468e+06     8.994547
                 Germany  2005 8.246942e+07     8.994324
                Paraguay  2011 6.293783e+06     8.992083
      Russian Federation  2008 1.427424e+07     8.991849
                   Spain  2015 4.644770e+07     8.991154
                 Croatia  2007 4.436000e+03     8.990991
                  Panama  2012 3.772938e+06     8.987077
                  Guyana  2001 7.522630e+05     8.986101
                 Germany  2010 8.177693e+06     8.982085
                 Belgium  2009 1.796493e+06     8.982014
                   Tonga  2001 9.861100e+04     8.978850
                  Guyana  2007 7.478690e+05     8.976775
                 Belarus  2011 9.473172e+06     8.976139
               Swaziland  2011 1.225258e+06     8.974178
                 Georgia  2015 3.717100e+04     8.973437
                 Croatia  2010 4.417781e+06     8.972913
                Thailand  2003 6.455495e+07     8.972705
                 Hungary  2014 9.866468e+06     8.972354
                   Spain  2014 4.648882e+06     8.970899
                  Serbia  2004 7.463157e+06     8.969606
              Cabo Verde  2007 4.864380e+05     8.969013
                  Poland  2015 3.798641e+07     8.965649
              Kazakhstan  2014 1.728922e+07     8.963391
                 Ukraine  2012 4.559330e+05     8.963353
              Costa Rica  2007 4.369469e+06     8.957905
                 Belarus  2008 9.527985e+06     8.956586
                 Croatia  2012 4.267558e+06     8.956460
              Mozambique  2001 1.858876e+07     8.952823
                 Denmark  2005 5.419432e+06     8.952623
                  Serbia  2006 7.411569e+06     8.951500
                  Norway  2008 4.768212e+06     8.951335
                  Serbia  2013 7.164132e+06     8.950501
                  Serbia  2002 7.496522e+06     8.949819
                 Belarus  2001 9.928549e+06     8.948835
                  Canada  2011 3.434278e+06     8.946529
                 Estonia  2005 1.354775e+06     8.942938
                 Vanuatu  2010 2.362950e+05     8.934623
                  Panama  2004 3.269541e+06     8.932562
                 Belarus  2005 9.663915e+06     8.930591
                 Myanmar  2001 4.662799e+07     8.930438
                 Romania  2003 2.157433e+07     8.926094
                Bulgaria  2005 7.658972e+06     8.924985
                Cameroon  2013 2.165572e+07     8.922967
              Montenegro  2013 6.212700e+04     8.922856
                 Ukraine  2006 4.678775e+06     8.922855
                  Poland  2014 3.811735e+06     8.921329
            Burkina Faso  2014 1.758598e+07     8.920319
               Indonesia  2004 2.236146e+08     8.918498
               Lithuania  2003 3.415213e+06     8.917364
                 Armenia  2007 2.933560e+05     8.915701
                 Romania  2015 1.981548e+07     8.912801
                 Finland  2003 5.213140e+05     8.911289
  Bosnia and Herzegovina  2011 3.688865e+06     8.908739
                    Mali  2004 1.239196e+06     8.903427
                Portugal  2011 1.557560e+05     8.901214
               Argentina  2013 4.253992e+07     8.900514
                  Latvia  2003 2.287955e+06     8.897155
                 Iceland  2013 3.237640e+05     8.896198
                 Estonia  2004 1.362550e+05     8.893625
                 Jamaica  2010 2.817210e+05     8.891194
                  Zambia  2006 1.238345e+07     8.889699
                   Italy  2010 5.927742e+07     8.887207
               Lithuania  2007 3.231294e+06     8.881663
                 Burundi  2012 9.319710e+05     8.876969
                Bulgaria  2013 7.265115e+06     8.872583
                 Georgia  2006 4.136000e+03     8.871122
                    Mali  2011 1.554989e+06     8.867621
                   Malta  2001 3.932800e+04     8.864058
                 Georgia  2004 4.245000e+03     8.849188
                 Armenia  2015 2.916950e+05     8.847242
              Bangladesh  2010 1.521491e+07     8.844794
               Lithuania  2005 3.322528e+06     8.836512
   Sao Tome and Principe  2010 1.747760e+05     8.811711
                Cameroon  2003 1.651382e+07     8.801151
               Guatemala  2009 1.431628e+06     8.781151
                   Kenya  2010 4.135152e+06     8.759070
                 Hungary  2003 1.129552e+06     8.748610
              Seychelles  2011 8.744100e+04     8.740559
                  Rwanda  2014 1.134536e+07     8.737242
                 Hungary  2001 1.187576e+06     8.736544
                  Zambia  2009 1.345642e+07     8.733274
                    Togo  2014 7.228915e+06     8.730042
                Portugal  2012 1.514844e+06     8.725751
               Sri Lanka  2004 1.922800e+04     8.696420
                 Belgium  2012 1.112825e+07     8.695756
                Zimbabwe  2011 1.438665e+07     8.679395
                Ethiopia  2003 7.254514e+07     8.676309
                Kiribati  2005 9.232500e+04     8.675645
                Cambodia  2010 1.438740e+05     8.637862
                 Albania  2006 2.992547e+06     8.607293
                Portugal  2015 1.358760e+05     8.594408
                 Tunisia  2005 1.124820e+05     8.563983
                  Israel  2014 8.215700e+04     8.558697
              Azerbaijan  2001 8.111200e+04     8.558331
                Paraguay  2009 6.127837e+06     8.469442
             Timor-Leste  2011 1.131523e+06     8.461607
                 Comoros  2012 7.238680e+05     8.453800
                   Benin  2002 7.295394e+06     8.392409
                  Bhutan  2003 6.234340e+05     8.389208
                  Latvia  2014 1.993782e+06     8.376018
                 Armenia  2002 3.338970e+05     8.364661
                 Iceland  2007 3.115660e+05     8.222841
                 Austria  2003 8.121423e+06     8.208412
             Switzerland  2014 8.188649e+06     8.207495
                Maldives  2004 3.120000e+02     8.176471
               Lithuania  2012 2.987773e+06     8.105871
              Luxembourg  2011 5.183470e+05     8.101312
Central African Republic  2005 4.127910e+05     8.065157
      Dominican Republic  2012 1.154950e+05     8.026573
                  Latvia  2012 2.343190e+05     8.019554
                 Eritrea  2007 4.153332e+06     7.900353
                  France  2001 6.135743e+07     7.876304
                 Hungary  2008 1.381880e+05     7.870715
                  Norway  2014 5.137232e+06     7.863058
            Turkmenistan  2011 5.174610e+05     7.812197
              Montenegro  2003 6.122670e+05     7.768216
                  Brazil  2013 2.248632e+06     7.750120
                  Belize  2009 3.139290e+05     7.680465
             South Sudan  2011 1.448857e+06     7.665827
            South Africa  2011 5.172935e+07     7.651213
                Cameroon  2012 2.182383e+06     7.644916
                 Myanmar  2013 5.144820e+07     7.594016
                 Liberia  2003 3.116233e+06     7.587905
             Afghanistan  2013 3.173169e+07     7.583189
                    Iraq  2011 3.172753e+06     7.432095
                   Malta  2009 4.124770e+05     7.353288
              Uzbekistan  2015 3.129890e+05     7.329271
                Slovenia  2007 2.181220e+05     7.118282
                  Panama  2002 3.149265e+06     7.081587
                 Morocco  2007 3.122588e+07     7.070067
                Kiribati  2010 1.265200e+04     7.068878
                 Namibia  2005 2.321960e+05     6.944300
                    Peru  2015 3.137667e+07     6.896772
                   Tonga  2005 1.141000e+03     6.815068
               Australia  2008 2.124920e+05     6.514924
             Afghanistan  2002 2.197992e+07     6.409471
                 Lesotho  2013 2.117361e+06     6.303058
                  Guinea  2008 1.323142e+06     5.725777
             Timor-Leste  2010 1.195910e+05     5.221893
       Equatorial Guinea  2014 1.129424e+06     5.146659
               Swaziland  2005 1.158730e+05     4.926098
                 Senegal  2005 1.125127e+07     4.752346
                  Cyprus  2010 1.112670e+05     4.598058
            South Africa  2004 4.717990e+03    -0.999898
                    Iraq  2014 3.568000e+03    -0.999895
                 Ukraine  2002 4.822500e+04    -0.999009
  Bosnia and Herzegovina  2007 3.774000e+03    -0.999001
               Mauritius  2010 1.254000e+03    -0.998995
                 Myanmar  2009 4.986900e+04    -0.998992
              Kazakhstan  2008 1.567400e+04    -0.998988
                 Algeria  2007 3.437600e+04    -0.998982
                  Mexico  2011 1.199170e+05    -0.998978
                 Uruguay  2013 3.485000e+03    -0.998974
                Cameroon  2008 1.897800e+04    -0.998968
                Ethiopia  2007 8.149000e+03    -0.998967
                 Senegal  2011 1.339100e+04    -0.998963
                   Niger  2002 1.226200e+04    -0.998958
                 Ecuador  2002 1.372600e+04    -0.998932
                  Turkey  2008 7.443200e+04    -0.998931
                   India  2015 1.395398e+06    -0.998922
              Tajikistan  2006 7.557000e+03    -0.998897
             El Salvador  2004 6.775000e+03    -0.998865
                  Cyprus  2004 1.141000e+03    -0.998852
                  Bhutan  2008 7.950000e+02    -0.998843
            Turkmenistan  2009 5.795000e+03    -0.998826
                 Tunisia  2004 1.176100e+04    -0.998817
      Dominican Republic  2011 1.279500e+04    -0.998707
                  Brazil  2012 2.569830e+05    -0.998707
                   Tonga  2004 1.460000e+02    -0.998537
    Syrian Arab Republic  2012 2.427100e+04    -0.991525
    Syrian Arab Republic  2014 1.923900e+04    -0.990328
                Portugal  2014 1.416200e+04    -0.990282
               Nicaragua  2001 5.175000e+03    -0.990176
  Bosnia and Herzegovina  2012 3.648200e+04    -0.990110
                 Armenia  2006 2.958500e+04    -0.990076
                 Albania  2007 2.971700e+04    -0.990070
                Bulgaria  2006 7.612200e+04    -0.990061
                 Albania  2012 2.941000e+03    -0.990037
                 Germany  2008 8.211970e+05    -0.990018
                 Croatia  2015 4.236400e+04    -0.990005
                  Poland  2007 3.812560e+05    -0.990004
      Russian Federation  2004 1.446754e+06    -0.989998
                Slovenia  2001 1.992600e+04    -0.989982
                   Japan  2007 1.281000e+03    -0.989981
                 Hungary  2010 1.230000e+02    -0.989971
                Portugal  2010 1.573100e+04    -0.989969
                  Guyana  2011 7.491000e+03    -0.989966
                Thailand  2015 6.865760e+05    -0.989965
     Trinidad and Tobago  2010 1.328100e+04    -0.989951
               Lithuania  2015 2.949100e+04    -0.989943
                Thailand  2010 6.728880e+05    -0.989939
                  Latvia  2007 2.232500e+04    -0.989936
Central African Republic  2015 4.546100e+04    -0.989932
             Switzerland  2003 7.339100e+04    -0.989925
      Russian Federation  2015 1.449687e+06    -0.989920
              Kazakhstan  2003 1.499180e+05    -0.989911
                   China  2002 1.284000e+03    -0.989904
                 Belgium  2014 1.129570e+05    -0.989899
                Thailand  2004 6.522310e+05    -0.989896
            Turkmenistan  2001 4.564800e+04    -0.989892
     Trinidad and Tobago  2015 1.369200e+04    -0.989891
                 Myanmar  2002 4.714220e+05    -0.989890
              Uzbekistan  2005 2.616700e+04    -0.989883
            Turkmenistan  2006 4.811500e+04    -0.989880
                  Sweden  2013 9.637900e+04    -0.989875
                    Peru  2005 2.761410e+05    -0.989875
                Zimbabwe  2005 1.294320e+05    -0.989870
              Montenegro  2011 6.279000e+03    -0.989863
                  Mexico  2014 1.242216e+06    -0.989862
                  Brazil  2001 1.777567e+06    -0.989859
              Uzbekistan  2007 2.686800e+04    -0.989857
              Kazakhstan  2011 1.655660e+05    -0.989856
              Cabo Verde  2013 5.216000e+03    -0.989852
                Zimbabwe  2002 1.255250e+05    -0.989849
                Cambodia  2008 1.388590e+05    -0.989847
                    Peru  2009 2.915700e+04    -0.989820
                Honduras  2011 8.351600e+04    -0.989809
              Bangladesh  2001 1.341716e+06    -0.989803
                   Haiti  2008 9.752900e+04    -0.989795
                  Turkey  2014 7.736280e+05    -0.989792
              Seychelles  2006 8.460000e+02    -0.989790
             Philippines  2002 8.135260e+05    -0.989788
                Pakistan  2003 1.477341e+06    -0.989787
Central African Republic  2009 4.442300e+04    -0.989777
   Sao Tome and Principe  2011 1.788000e+03    -0.989770
                 Myanmar  2004 4.873770e+05    -0.989766
              Costa Rica  2013 4.764100e+04    -0.989764
            Sierra Leone  2009 6.312600e+04    -0.989761
                  Bhutan  2011 7.451000e+03    -0.989760
                 Comoros  2013 7.415000e+03    -0.989756
               Nicaragua  2011 5.878200e+04    -0.989755
                  Canada  2010 3.452740e+05    -0.989733
                Ethiopia  2010 8.772670e+05    -0.989730
                   Ghana  2007 2.272120e+05    -0.989725
                   Kenya  2002 3.321490e+05    -0.989724
                  Guinea  2014 1.185590e+05    -0.989723
                  Bhutan  2001 5.896000e+03    -0.989718
                   Kenya  2006 3.752500e+04    -0.989714
                 Eritrea  2005 3.969700e+04    -0.989712
            Burkina Faso  2002 1.229310e+05    -0.989708
                   Benin  2011 9.468200e+04    -0.989708
                  Zambia  2010 1.385330e+05    -0.989705
                  Israel  2004 6.890000e+02    -0.989701
                   Benin  2004 7.754000e+03    -0.989696
                  Rwanda  2006 9.265800e+04    -0.989695
                   Kenya  2014 4.624250e+05    -0.989684
                  Greece  2002 1.922200e+04    -0.989677
             Afghanistan  2014 3.275820e+05    -0.989677
                Portugal  2005 1.533300e+04    -0.989667
             Netherlands  2001 1.646180e+05    -0.989663
                   Gabon  2010 1.642100e+04    -0.989651
              Mozambique  2015 2.816910e+05    -0.989648
             South Sudan  2007 8.856800e+04    -0.989541
                 Belgium  2007 1.625700e+04    -0.989498
        Papua New Guinea  2003 6.172400e+04    -0.989471
            Burkina Faso  2004 1.335690e+05    -0.989445
                  Jordan  2014 8.893600e+04    -0.989429
                  Sweden  2006 9.855000e+03    -0.989398
                Honduras  2009 8.352100e+04    -0.989391
    Syrian Arab Republic  2002 1.787910e+05    -0.989337
                  Uganda  2015 4.144870e+05    -0.989327
                Pakistan  2004 1.578300e+04    -0.989317
             Afghanistan  2005 2.577980e+05    -0.989311
                    Mali  2012 1.666700e+04    -0.989282
                  Angola  2012 2.596150e+05    -0.989280
                    Mali  2003 1.251280e+05    -0.989249
                   Kenya  2009 4.237240e+05    -0.989176
                  Malawi  2012 1.697350e+05    -0.989139
                Suriname  2006 5.437000e+03    -0.989103
               Guatemala  2005 1.396280e+05    -0.989089
                   Tonga  2008 1.350000e+02    -0.989075
               Swaziland  2002 1.893000e+03    -0.989053
                    Chad  2015 1.494130e+05    -0.988989
                 Morocco  2005 3.521700e+04    -0.988923
             Philippines  2014 1.122490e+05    -0.988602
Central African Republic  2004 4.553600e+04    -0.988564
                   Haiti  2011 1.145540e+05    -0.988544
                 Lebanon  2006 4.573500e+04    -0.988529
              Costa Rica  2002 4.632400e+04    -0.988410
                  Mexico  2001 1.367680e+05    -0.988330
              Mozambique  2004 2.312750e+05    -0.988270
                 Ireland  2004 4.726200e+04    -0.988174
                 Liberia  2011 4.716700e+04    -0.988053
                 Senegal  2003 1.679900e+04    -0.987974
                 Romania  2012 2.583500e+04    -0.987970
                Slovenia  2005 2.474000e+03    -0.987612
                    Iraq  2010 3.762710e+05    -0.987413
                 Namibia  2004 2.922800e+04    -0.985287
                   Benin  2013 1.445100e+04    -0.985147
                    Chad  2005 1.679000e+03    -0.982716
                  Guinea  2007 1.967270e+05    -0.980091
                  Greece  2006 1.123620e+05    -0.943460
                  Zambia  2002 1.112490e+05    -0.939012
                  Mexico  2006 1.192378e+06    -0.935450
                   Ghana  2005 2.154290e+05    -0.927867
                   Sudan  2006 3.167640e+05    -0.919026
                  Latvia  2011 2.597900e+04    -0.912692
                 Hungary  2009 1.226500e+04    -0.911244
                 Georgia  2008 4.300000e+01    -0.910788
             El Salvador  2008 6.113100e+04    -0.910559
              Mauritania  2005 3.137200e+04    -0.908489
                  Greece  2009 1.117170e+05    -0.905151
                 Armenia  2003 3.178600e+04    -0.904803
                 Armenia  2001 3.565500e+04    -0.903528
                  Greece  2015 1.828830e+05    -0.903360
                 Croatia  2011 4.286220e+05    -0.902978
                 Albania  2002 3.511000e+03    -0.902939
                Cambodia  2005 1.327210e+05    -0.902653
                 Hungary  2002 1.158680e+05    -0.902433
  Bosnia and Herzegovina  2014 3.566200e+04    -0.902296
               Lithuania  2006 3.269990e+05    -0.901581
                 Romania  2001 2.213197e+06    -0.901386
                 Georgia  2005 4.190000e+02    -0.901296
                 Namibia  2009 2.137400e+04    -0.901218
               Lithuania  2004 3.377750e+05    -0.901097
                 Georgia  2003 4.310000e+02    -0.901079
                  Serbia  2011 7.234990e+05    -0.900774
                 Ukraine  2004 4.745160e+05    -0.900756
                Bulgaria  2004 7.716860e+05    -0.900752
                 Georgia  2002 4.357000e+03    -0.900670
                 Belarus  2004 9.731460e+05    -0.900666
  Bosnia and Herzegovina  2010 3.722840e+05    -0.900633
                 Estonia  2002 1.379350e+05    -0.900631
               Lithuania  2001 3.478180e+05    -0.900610
                 Romania  2006 2.119376e+06    -0.900591
                 Estonia  2006 1.346810e+05    -0.900588
                Maldives  2009 3.600000e+01    -0.900552
                  Poland  2013 3.841960e+05    -0.900549
                 Albania  2010 2.913210e+05    -0.900489
                 Hungary  2012 9.923620e+05    -0.900482
      Russian Federation  2007 1.428588e+06    -0.900444
      Russian Federation  2001 1.459768e+07    -0.900423
      Russian Federation  2002 1.453646e+06    -0.900419
                 Ukraine  2007 4.659350e+05    -0.900415
                  Serbia  2008 7.352220e+05    -0.900398
                  Serbia  2014 7.135760e+05    -0.900396
                  Serbia  2009 7.328700e+04    -0.900320
                   Spain  2013 4.662450e+05    -0.900319
             El Salvador  2002 5.943300e+04    -0.900274
                 Ukraine  2015 4.515429e+06    -0.900260
                 Hungary  2015 9.843280e+05    -0.900235
                  Serbia  2005 7.447690e+05    -0.900207
                 Estonia  2003 1.377200e+04    -0.900156
                 Belarus  2006 9.649240e+05    -0.900152
                  Serbia  2003 7.485910e+05    -0.900142
                 Albania  2015 2.887300e+04    -0.900064
                 Croatia  2006 4.440000e+02    -0.900045
                 Iceland  2010 3.184100e+04    -0.900028
                  Poland  2001 3.824876e+06    -0.900026
                 Germany  2004 8.251626e+06    -0.900022
                 Uruguay  2004 3.324960e+05    -0.900020
      Russian Federation  2006 1.434953e+07    -0.900016
                 Estonia  2007 1.346800e+04    -0.900001
  Bosnia and Herzegovina  2005 3.781530e+05    -0.899994
                 Romania  2014 1.998979e+06    -0.899969
                  Poland  2003 3.824570e+05    -0.899968
                   Italy  2001 5.697410e+05    -0.899944
                   Spain  2012 4.677355e+06    -0.899934
                  Poland  2009 3.815163e+06    -0.899932
                Slovenia  2004 1.997120e+05    -0.899931
                 Estonia  2015 1.315470e+05    -0.899930
      Russian Federation  2011 1.429687e+07    -0.899917
              Seychelles  2001 8.122000e+03    -0.899890
  Bosnia and Herzegovina  2002 3.775870e+05    -0.899878
             Netherlands  2014 1.686580e+05    -0.899872
                Bulgaria  2012 7.358880e+05    -0.899856
             Netherlands  2006 1.634611e+06    -0.899839
              Montenegro  2006 6.152500e+04    -0.899839
                 Uruguay  2006 3.331430e+05    -0.899825
                 Uruguay  2001 3.327130e+05    -0.899823
                   Japan  2010 1.287000e+03    -0.899821
                  Serbia  2001 7.534330e+05    -0.899761
                 Austria  2010 8.363440e+05    -0.899759
                 Finland  2001 5.188800e+04    -0.899758
                  Greece  2011 1.114899e+06    -0.899751
                 Croatia  2001 4.440000e+02    -0.899684
                Cambodia  2014 1.527790e+05    -0.899665
     Trinidad and Tobago  2001 1.272380e+05    -0.899653
                 Finland  2005 5.246960e+05    -0.899641
                 Denmark  2003 5.395740e+05    -0.899632
Central African Republic  2012 4.494160e+05    -0.899598
                 Armenia  2013 2.893590e+05    -0.899595
                 Belarus  2009 9.567650e+05    -0.899584
                 Finland  2007 5.288720e+05    -0.899574
                  France  2015 6.662468e+06    -0.899559
                Thailand  2013 6.814365e+06    -0.899558
             El Salvador  2011 6.192560e+05    -0.899547
               Mauritius  2007 1.239630e+05    -0.899543
             Netherlands  2011 1.669374e+06    -0.899528
             Netherlands  2003 1.622532e+06    -0.899527
                  France  2012 6.565979e+06    -0.899515
                   China  2009 1.331260e+05    -0.899501
                 Denmark  2011 5.575720e+05    -0.899495
                   Italy  2007 5.843831e+06    -0.899494
                   Malta  2010 4.145800e+04    -0.899490
     Trinidad and Tobago  2003 1.284520e+05    -0.899477
                Thailand  2008 6.654576e+06    -0.899471
             Netherlands  2013 1.684432e+06    -0.899467
             Netherlands  2009 1.653388e+06    -0.899463
                 Denmark  2009 5.523950e+05    -0.899448
                 Uruguay  2008 3.358240e+05    -0.899446
                 Jamaica  2008 2.791220e+05    -0.899432
                 Ireland  2010 4.561550e+05    -0.899423
             El Salvador  2013 6.257770e+05    -0.899413
                  Norway  2004 4.591910e+05    -0.899407
                 Ukraine  2009 4.653300e+04    -0.899406
                  France  2009 6.477440e+05    -0.899380
               Mauritius  2004 1.221300e+04    -0.899346
                 Jamaica  2006 2.762790e+05    -0.899340
                    Fiji  2014 8.858600e+04    -0.899301
                 Jamaica  2012 2.849920e+05    -0.899278
                   China  2001 1.271850e+05    -0.899271
                  Cyprus  2014 1.152390e+05    -0.899257
               Sri Lanka  2006 1.952000e+03    -0.899241
              Azerbaijan  2003 8.234100e+04    -0.899239
                 Jamaica  2001 2.677110e+05    -0.899238
                   Samoa  2010 1.862500e+04    -0.899230
                Slovenia  2014 2.619800e+04    -0.899220
                  France  2002 6.185267e+06    -0.899193
                  France  2004 6.274897e+06    -0.899190
               Mauritius  2014 1.269340e+05    -0.899151
                  France  2007 6.416229e+06    -0.899150
                  Canada  2015 3.584861e+06    -0.899145
                 Lesotho  2008 1.999930e+05    -0.899110
                 Tunisia  2001 9.785710e+05    -0.899108
                    Fiji  2010 8.599500e+04    -0.899063
                  Norway  2006 4.666770e+05    -0.899060
                  Cyprus  2012 1.135620e+05    -0.899041
                Suriname  2015 5.532800e+04    -0.899023
                 Myanmar  2015 5.243669e+06    -0.899013
                  Norway  2015 5.188670e+05    -0.898999
                   Nepal  2009 2.674113e+06    -0.898998
              Seychelles  2012 8.833000e+03    -0.898983
                    Fiji  2008 8.433400e+04    -0.898978
            Turkmenistan  2002 4.612000e+03    -0.898966
                  Guyana  2005 7.594600e+04    -0.898961
             Switzerland  2010 7.824990e+05    -0.898952
               Argentina  2006 3.955889e+06    -0.898944
                Suriname  2004 4.936300e+04    -0.898915
                   Chile  2005 1.614764e+06    -0.898912
                   Italy  2002 5.759700e+04    -0.898907
                Colombia  2010 4.591897e+06    -0.898893
                 Iceland  2004 2.927400e+04    -0.898888
                   Samoa  2006 1.819400e+04    -0.898882
              Azerbaijan  2007 8.581300e+04    -0.898860
              Bangladesh  2015 1.612886e+06    -0.898849
                Suriname  2012 5.377700e+04    -0.898837
              Bangladesh  2008 1.488581e+07    -0.898832
              Bangladesh  2012 1.557275e+07    -0.898820
                  Brazil  2006 1.891241e+07    -0.898819
              Bangladesh  2014 1.594528e+07    -0.898806
                   Nepal  2013 2.798531e+06    -0.898787
                Colombia  2008 4.491544e+06    -0.898781
              Luxembourg  2003 4.516300e+04    -0.898777
                    Peru  2008 2.864198e+06    -0.898766
                Mongolia  2006 2.558120e+05    -0.898746
              Costa Rica  2010 4.545280e+05    -0.898730
                 Algeria  2002 3.199546e+06    -0.898723
              Azerbaijan  2013 9.416810e+05    -0.898698
                  Norway  2011 4.953880e+05    -0.898678
               Indonesia  2011 2.457751e+07    -0.898660
                    Peru  2002 2.661467e+06    -0.898655
               Indonesia  2009 2.393448e+07    -0.898651
              Bangladesh  2006 1.453684e+06    -0.898649
               Nicaragua  2006 5.452110e+05    -0.898647
                Paraguay  2013 6.465740e+05    -0.898644
              Costa Rica  2008 4.429580e+05    -0.898624
Central African Republic  2007 4.275800e+04    -0.898620
               Indonesia  2013 2.523226e+07    -0.898618
               Indonesia  2005 2.267127e+07    -0.898615
                   Chile  2012 1.739746e+06    -0.898577
               Indonesia  2001 2.145652e+06    -0.898572
                 Morocco  2014 3.431882e+06    -0.898539
                   India  2008 1.197147e+08    -0.898519
                  Turkey  2002 6.514354e+06    -0.898517
              Costa Rica  2004 4.187380e+05    -0.898512
               Nicaragua  2003 5.248790e+05    -0.898510
               Australia  2014 2.346694e+06    -0.898488
      Dominican Republic  2003 8.967760e+05    -0.898466
                Zimbabwe  2007 1.332999e+06    -0.898432
             El Salvador  2001 5.959620e+05    -0.898432
                   Haiti  2005 9.263440e+05    -0.898418
                 Morocco  2010 3.249639e+06    -0.898417
                   Nepal  2003 2.495623e+06    -0.898413
              Azerbaijan  2004 8.365000e+03    -0.898410
              Kazakhstan  2006 1.538840e+05    -0.898408
                  Turkey  2011 7.349455e+06    -0.898386
            South Africa  2012 5.256516e+06    -0.898384
                  Mexico  2008 1.136619e+07    -0.898368
             Philippines  2011 9.527794e+06    -0.898345
                 Ecuador  2010 1.493469e+06    -0.898343
                 Armenia  2008 2.982200e+04    -0.898342
                Mongolia  2010 2.712650e+05    -0.898337
                Djibouti  2006 7.962800e+04    -0.898337
             Philippines  2013 9.848132e+06    -0.898333
                Paraguay  2003 5.679500e+04    -0.898328
            Turkmenistan  2004 4.733980e+05    -0.898320
                Djibouti  2009 8.368400e+04    -0.898310
                 Ecuador  2006 1.396748e+06    -0.898309
                 Ecuador  2014 1.593112e+06    -0.898279
            South Africa  2015 5.511977e+06    -0.898203
              Luxembourg  2008 4.886500e+04    -0.898196
                Malaysia  2008 2.711169e+06    -0.898175
      Dominican Republic  2007 9.543530e+05    -0.898163
              Bangladesh  2003 1.391910e+05    -0.898153
                   Spain  2007 4.522683e+06    -0.898132
                Thailand  2002 6.473164e+06    -0.898130
             Philippines  2006 8.789419e+06    -0.898122
                Malaysia  2012 2.917456e+06    -0.898116
                Zimbabwe  2009 1.381599e+06    -0.898101
              Cabo Verde  2002 4.521600e+04    -0.898097
                 Eritrea  2009 4.313340e+05    -0.898093
                   Italy  2009 5.995365e+06    -0.898084
                  Israel  2012 7.915000e+03    -0.898079
               Swaziland  2014 1.295970e+05    -0.898072
                Mongolia  2013 2.869170e+05    -0.898048
                 Lesotho  2002 1.923120e+05    -0.898029
                 Eritrea  2010 4.398400e+04    -0.898028
                  Panama  2005 3.334650e+05    -0.898009
                  Israel  2015 8.381000e+03    -0.897988
                Malaysia  2003 2.468873e+06    -0.897975
                  Israel  2002 6.570000e+02    -0.897966
                 Jamaica  2009 2.848200e+04    -0.897959
Central African Republic  2001 3.832230e+05    -0.897943
                Pakistan  2015 1.893851e+07    -0.897931
                  Cyprus  2001 9.628200e+04    -0.897929
                Kiribati  2006 9.426000e+03    -0.897904
                   India  2010 1.239869e+07    -0.897892
               Guatemala  2003 1.254780e+05    -0.897890
                   Sudan  2009 3.365619e+06    -0.897874
         Solomon Islands  2014 5.755400e+04    -0.897866
           Guinea-Bissau  2003 1.321220e+05    -0.897859
                Suriname  2009 5.261900e+04    -0.897857
               Guatemala  2012 1.527156e+06    -0.897842
                Kiribati  2008 9.844000e+03    -0.897789
             Timor-Leste  2012 1.156760e+05    -0.897770
                Pakistan  2001 1.416144e+07    -0.897769
                Cambodia  2001 1.242473e+06    -0.897759
              Costa Rica  2011 4.647400e+04    -0.897753
                 Vanuatu  2014 2.588500e+04    -0.897745
              Tajikistan  2010 7.641630e+05    -0.897741
                  Belize  2012 3.367100e+04    -0.897716
              Tajikistan  2012 7.995620e+05    -0.897701
                Zimbabwe  2012 1.471826e+06    -0.897695
                 Namibia  2013 2.316520e+05    -0.897677
        Papua New Guinea  2012 7.438360e+05    -0.897675
            Sierra Leone  2012 6.766130e+05    -0.897664
                   Sudan  2012 3.599192e+06    -0.897655
                Pakistan  2009 1.674958e+06    -0.897647
                 Armenia  2014 2.962200e+04    -0.897629
   Sao Tome and Principe  2007 1.631100e+04    -0.897626
         Solomon Islands  2007 4.929400e+04    -0.897608
              Luxembourg  2015 5.696400e+04    -0.897606
               Swaziland  2009 1.186750e+05    -0.897597
   Sao Tome and Principe  2005 1.556300e+04    -0.897591
                   Sudan  2015 3.864783e+06    -0.897589
                 Comoros  2007 6.416200e+04    -0.897574
                   Ghana  2012 2.573349e+06    -0.897565
                 Vanuatu  2008 2.253400e+04    -0.897551
                  Belize  2010 3.216800e+04    -0.897531
        Papua New Guinea  2005 6.314790e+05    -0.897512
                Djibouti  2004 7.775200e+04    -0.897508
   Sao Tome and Principe  2013 1.874500e+04    -0.897506
                    Iraq  2007 2.839433e+06    -0.897486
                   Sudan  2001 2.794550e+05    -0.897468
           Guinea-Bissau  2010 1.555880e+05    -0.897467
              Costa Rica  2015 4.878520e+05    -0.897458
             Afghanistan  2008 2.729431e+06    -0.897455
                   Ghana  2001 1.942165e+06    -0.897450
                 Nigeria  2002 1.286667e+07    -0.897447
                   Gabon  2002 1.294490e+05    -0.897447
                Ethiopia  2015 9.987333e+06    -0.897426
    Syrian Arab Republic  2004 1.786638e+06    -0.897410
                 Liberia  2014 4.397370e+05    -0.897409
                Honduras  2001 6.693610e+05    -0.897405
                    Togo  2015 7.416820e+05    -0.897401
                 Germany  2013 8.645650e+05    -0.897391
                  Turkey  2003 6.685830e+05    -0.897368
                  Panama  2014 3.939860e+05    -0.897358
                 Nigeria  2006 1.426149e+07    -0.897355
     Trinidad and Tobago  2006 1.331440e+05    -0.897339
                    Togo  2003 5.391410e+05    -0.897335
                 Liberia  2005 3.261230e+05    -0.897330
                    Iraq  2004 2.631669e+06    -0.897311
                Cameroon  2014 2.223994e+06    -0.897302
                 Nigeria  2014 1.764652e+06    -0.897302
                Cameroon  2004 1.695981e+06    -0.897299
                Honduras  2014 8.892160e+05    -0.897293
                 Nigeria  2013 1.718293e+07    -0.897291
                  Greece  2012 1.145110e+05    -0.897290
                 Nigeria  2011 1.628778e+07    -0.897289
                   Gabon  2004 1.364250e+05    -0.897282
              Madagascar  2014 2.358981e+06    -0.897262
                  Malawi  2004 1.267638e+06    -0.897246
                Paraguay  2010 6.298770e+05    -0.897211
                Cameroon  2010 1.997495e+06    -0.897209
                    Togo  2009 6.334720e+05    -0.897194
              Cabo Verde  2006 4.879500e+04    -0.897180
                 Senegal  2013 1.412320e+05    -0.897175
               Nicaragua  2004 5.397300e+04    -0.897171
              Seychelles  2010 8.977000e+03    -0.897168
                   Benin  2009 8.944760e+05    -0.897150
                  Belize  2002 2.622600e+04    -0.897146
                  Guinea  2003 9.398480e+05    -0.897142
                Ethiopia  2004 7.462445e+06    -0.897134
           Guinea-Bissau  2015 1.775260e+05    -0.897131
           Guinea-Bissau  2008 1.488410e+05    -0.897064
              Mozambique  2011 2.493950e+05    -0.897035
              Mozambique  2009 2.352463e+06    -0.897033
              Madagascar  2006 1.888268e+06    -0.897023
                    Mali  2015 1.746795e+06    -0.897022
                   Gabon  2015 1.931750e+05    -0.897012
            Burkina Faso  2015 1.811624e+06    -0.896985
                Cambodia  2013 1.522692e+06    -0.896954
             Timor-Leste  2015 1.249770e+05    -0.896953
               Guatemala  2002 1.228848e+06    -0.896951
            Burkina Faso  2007 1.425221e+06    -0.896941
            Burkina Faso  2009 1.514199e+06    -0.896921
                 Senegal  2008 1.223957e+06    -0.896917
                  Zambia  2013 1.515321e+06    -0.896916
              Mauritania  2012 3.832390e+05    -0.896914
                  Malawi  2009 1.471462e+06    -0.896893
                  Bhutan  2004 6.428200e+04    -0.896890
                   China  2005 1.337200e+04    -0.896881
                  Malawi  2007 1.384969e+06    -0.896869
               Argentina  2012 4.296739e+06    -0.896854
                   Benin  2003 7.525550e+05    -0.896845
              Madagascar  2004 1.782997e+06    -0.896812
                 Burundi  2013 9.618600e+04    -0.896793
               Nicaragua  2014 6.139970e+05    -0.896733
                 Iceland  2015 3.381500e+04    -0.896712
              Costa Rica  2006 4.387940e+05    -0.896702
              Kazakhstan  2013 1.735275e+06    -0.896657
                    Mali  2006 1.322764e+06    -0.896649
                  Sweden  2005 9.295720e+05    -0.896640
                 Burundi  2009 8.489310e+05    -0.896626
        Papua New Guinea  2010 7.182390e+05    -0.896618
                  Zambia  2015 1.615870e+05    -0.896616
              Tajikistan  2008 7.397280e+05    -0.896576
                  Uganda  2002 2.571848e+06    -0.896525
                   Malta  2006 4.538000e+03    -0.896473
                   Malta  2004 4.126800e+04    -0.896463
                  Uganda  2005 2.854394e+06    -0.896462
                    Chad  2009 1.152786e+06    -0.896461
                Botswana  2003 1.843390e+05    -0.896436
                  Panama  2011 3.777820e+05    -0.896306
                 Hungary  2004 1.171460e+05    -0.896290
                    Mali  2009 1.466597e+06    -0.896267
                  Angola  2003 1.823369e+06    -0.896238
                    Chad  2012 1.275135e+06    -0.896235
                  Malawi  2002 1.213711e+06    -0.896227
                   Niger  2006 1.413264e+06    -0.896224
    Syrian Arab Republic  2007 1.963286e+06    -0.896205
              Bangladesh  2009 1.545478e+06    -0.896178
                    Chad  2001 8.663120e+05    -0.896158
             South Sudan  2003 7.516420e+05    -0.896143
                    Chad  2004 9.714300e+04    -0.896139
                 Germany  2014 8.982500e+04    -0.896104
              Luxembourg  2012 5.394600e+04    -0.895927
             Afghanistan  2009 2.843310e+05    -0.895828
               Mauritius  2002 1.246210e+05    -0.895827
       Equatorial Guinea  2002 6.664700e+04    -0.895825
                Maldives  2014 4.100000e+01    -0.895674
               Swaziland  2004 1.955300e+04    -0.895657
                   Samoa  2013 1.975700e+04    -0.895573
             South Sudan  2009 9.676670e+05    -0.895536
                  Panama  2003 3.291740e+05    -0.895476
     Trinidad and Tobago  2007 1.392600e+04    -0.895406
                   Italy  2013 6.233948e+06    -0.895298
               Sri Lanka  2003 1.983000e+03    -0.895295
                 Nigeria  2008 1.534739e+06    -0.895180
             Timor-Leste  2003 9.685200e+04    -0.895162
              Mauritania  2008 3.475410e+05    -0.895087
                  Belize  2006 2.974700e+04    -0.894990
             South Sudan  2005 8.188770e+05    -0.894849
            Sierra Leone  2007 6.154170e+05    -0.894777
                  Jordan  2010 7.182390e+05    -0.894704
              Cabo Verde  2010 5.238400e+04    -0.894592
                 Lebanon  2003 3.714640e+05    -0.894555
                Zimbabwe  2013 1.554560e+05    -0.894379
                Botswana  2015 2.291970e+05    -0.894310
                    Iraq  2005 2.784260e+05    -0.894202
                    Chad  2002 9.168900e+04    -0.894162
                  Uganda  2011 3.593648e+06    -0.894040
            South Africa  2009 5.255813e+06    -0.893945
                Cambodia  2004 1.363377e+06    -0.893926
               Sri Lanka  2010 2.119000e+03    -0.893880
                    Peru  2012 3.158966e+06    -0.893852
                   Spain  2001 4.854120e+05    -0.893733
                 Morocco  2004 3.179285e+06    -0.893470
              Azerbaijan  2010 9.543320e+05    -0.893338
                  Malawi  2014 1.768838e+06    -0.893297
   Sao Tome and Principe  2009 1.781300e+04    -0.893280
                  Zambia  2005 1.252156e+06    -0.893268
                Honduras  2003 7.338210e+05    -0.893078
               Australia  2004 2.127400e+04    -0.893071
            Burkina Faso  2013 1.772723e+06    -0.893024
                 Lebanon  2012 4.916440e+05    -0.892850
                  Brazil  2015 2.596218e+06    -0.892776
                Kiribati  2004 9.542000e+03    -0.892660
                   Niger  2011 1.764636e+06    -0.892568
                    Mali  2010 1.575850e+05    -0.892551
            Burkina Faso  2011 1.681940e+05    -0.892543
                Cambodia  2009 1.492800e+04    -0.892495
                Cameroon  2002 1.684886e+06    -0.892490
             Afghanistan  2003 2.364851e+06    -0.892409
                 Burundi  2011 9.435800e+04    -0.892371
                 Myanmar  2011 5.553310e+05    -0.892292
             Timor-Leste  2008 1.781100e+04    -0.892037
                   Sudan  2004 3.186341e+06    -0.891753
                    Togo  2013 7.429480e+05    -0.891690
                Malaysia  2014 3.228170e+05    -0.891553
                Paraguay  2008 6.471170e+05    -0.891535
                Botswana  2010 2.148660e+05    -0.891475
                 Tunisia  2007 1.298870e+05    -0.891411
                  Zambia  2008 1.382517e+06    -0.891363
                   Haiti  2015 1.711610e+05    -0.891151
                   Italy  2014 6.789140e+05    -0.891094
              Uzbekistan  2013 3.243200e+04    -0.891075
                Bulgaria  2001 8.914200e+04    -0.890914
                Ethiopia  2002 7.497192e+06    -0.890540
                  Cyprus  2009 1.987600e+04    -0.890528
               Argentina  2009 4.799470e+05    -0.890483
                 Lesotho  2009 2.192900e+04    -0.890351
                   India  2002 1.898711e+07    -0.889274
                 Comoros  2011 7.656900e+04    -0.888981
               Nicaragua  2015 6.823500e+04    -0.888868
             Switzerland  2013 8.893460e+05    -0.888788
              Cabo Verde  2011 5.867000e+03    -0.888000
                Djibouti  2007 8.942000e+03    -0.887703
                   Benin  2001 7.767330e+05    -0.886872
                 Armenia  2004 3.612000e+03    -0.886365
      Dominican Republic  2014 1.458440e+05    -0.886175
                 Iceland  2006 3.378200e+04    -0.886154
                  Cyprus  2006 1.455900e+04    -0.885953
       Equatorial Guinea  2009 9.911100e+04    -0.885872
              Luxembourg  2010 5.695300e+04    -0.885587
                Maldives  2003 3.400000e+01    -0.885522
                  Guinea  2010 1.794170e+05    -0.884732
                  Angola  2006 2.262399e+06    -0.884291
              Mauritania  2004 3.428230e+05    -0.884069
                 Georgia  2007 4.820000e+02    -0.883462
              Mauritania  2014 4.639200e+04    -0.882438
                 Burundi  2015 1.199270e+05    -0.878761
                 Liberia  2002 3.628630e+05    -0.878687
                 Vanuatu  2004 2.414300e+04    -0.878656
                  Uganda  2007 3.594870e+05    -0.878373
                  Rwanda  2011 1.516710e+05    -0.878356
                 Vanuatu  2005 2.937000e+03    -0.878350
                   Ghana  2003 2.446782e+06    -0.877197
                    Chad  2007 1.775780e+05    -0.875086
                  Rwanda  2010 1.246842e+06    -0.875034
               Lithuania  2010 3.972820e+05    -0.874394
                Cameroon  2011 2.524470e+05    -0.873618
              Mozambique  2005 2.923700e+04    -0.873583
             Timor-Leste  2005 1.264840e+05    -0.873097
              Madagascar  2009 2.569121e+06    -0.871521
                  Latvia  2010 2.975550e+05    -0.861064
                Kiribati  2009 1.568000e+03    -0.840715
                 Hungary  2005 1.876500e+04    -0.839815
             South Sudan  2010 1.671920e+05    -0.827222
                 Tunisia  2013 1.114558e+06    -0.409245
                 Belgium  2011 1.147744e+06    -0.394518
       Equatorial Guinea  2012 1.385930e+05     0.393889
             South Sudan  2013 1.117749e+06    -0.385264
                Kiribati  2014 1.145800e+04    -0.381818
                   India  2003 1.182785e+07    -0.377059
                  Guinea  2011 1.135170e+05    -0.367301
                 Romania  2007 2.882982e+06     0.360298
    Syrian Arab Republic  2011 2.863993e+06     0.351684
                  Rwanda  2013 1.165151e+06    -0.348660
       Equatorial Guinea  2013 1.837460e+05     0.325796
                  Angola  2007 2.997687e+06     0.325004
                  Angola  2008 2.175942e+06    -0.274126
                Botswana  2013 2.128570e+05    -0.264273
             South Sudan  2012 1.818258e+06     0.254960
    Syrian Arab Republic  2010 2.118834e+06    -0.249942
             Afghanistan  2012 3.696958e+06     0.241173
                 Senegal  2002 1.396861e+06     0.231260
                   Sudan  2005 3.911914e+06     0.227714
                 Namibia  2008 2.163750e+05    -0.226997
                   Benin  2015 1.575952e+06     0.224790
                   Ghana  2004 2.986536e+06     0.220598
    Syrian Arab Republic  2009 2.824893e+06     0.214776
                  Belize  2008 3.616500e+04     0.211680
                  Zambia  2001 1.824125e+06     0.191288
    Syrian Arab Republic  2008 2.325443e+06     0.184465
              Madagascar  2010 2.115164e+06    -0.176697
                  Guinea  2009 1.556524e+06     0.176385
             Timor-Leste  2006 1.486210e+05     0.175018
               Lithuania  2011 3.281150e+05    -0.174101
                Botswana  2011 2.513390e+05     0.169748
              Uzbekistan  2014 3.757700e+04     0.158640
                Kiribati  2011 1.465600e+04     0.158394
                  Canada  2001 3.181900e+04    -0.155928
                 Lebanon  2008 4.111470e+05    -0.154829
                Botswana  2012 2.893150e+05     0.151095
                 Senegal  2001 1.134497e+06     0.147751
            South Africa  2010 5.979432e+06     0.137680
                 Tunisia  2008 1.473360e+05     0.134340
                Kiribati  2012 1.661300e+04     0.133529
                    Peru  2013 3.565716e+06     0.128760
               Australia  2006 2.697900e+04     0.126566
               Australia  2005 2.394800e+04     0.125693
            Sierra Leone  2014 7.791620e+05     0.125503
                   Haiti  2012 1.289210e+05     0.125417
                Colombia  2001 4.988990e+05     0.123753
                 Romania  2008 2.537875e+06    -0.119705
                 Lesotho  2010 2.455100e+04     0.119568
                  Norway  2013 5.796230e+05     0.117727
                Kiribati  2013 1.853500e+04     0.115693
                  Serbia  2015 7.953830e+05     0.114644
                    Peru  2014 3.973354e+06     0.114321
             Timor-Leste  2007 1.649730e+05     0.110025
                  Cyprus  2008 1.815630e+05     0.109039
         Solomon Islands  2008 5.447700e+04     0.105145
                 Namibia  2006 2.557340e+05     0.101371
                Suriname  2007 5.975000e+03     0.098952
                   Haiti  2014 1.572466e+06     0.098263
                 Lesotho  2012 2.899280e+05     0.097522
               Argentina  2008 4.382389e+06     0.096358
                 Namibia  2007 2.799150e+05     0.094555
                  Latvia  2013 2.126470e+05    -0.092489
             Philippines  2008 9.751864e+06     0.092114
                 Hungary  2007 1.557800e+04    -0.090973
                  Mexico  2003 1.564453e+06     0.089780
                  Israel  2006 7.537000e+03     0.087433
                  Mexico  2005 1.847223e+07     0.086884
                 Hungary  2006 1.713700e+04    -0.086757
                  Israel  2013 8.595000e+03     0.085913
                Slovenia  2009 2.396690e+05     0.082927
                Ethiopia  2011 9.467560e+05     0.079211
             Timor-Leste  2009 1.922100e+04     0.079165
                 Tunisia  2010 1.639931e+06     0.077602
                Zimbabwe  2010 1.486317e+06     0.075795
                 Tunisia  2011 1.761467e+06     0.074110
                 Lebanon  2013 5.276120e+05     0.073159
               Sri Lanka  2014 2.771000e+03     0.071954
               Sri Lanka  2011 2.271000e+03     0.071732
            Sierra Leone  2015 7.237250e+05    -0.071150
                 Tunisia  2012 1.886668e+06     0.071078
               Swaziland  2001 1.729270e+05     0.070968
                 Albania  2001 3.617300e+04    -0.070748
               Sri Lanka  2015 2.966000e+03     0.070372
               Sri Lanka  2012 2.425000e+03     0.067812
                 Lebanon  2014 5.632790e+05     0.067601
                 Romania  2009 2.367487e+06    -0.067138
               Guatemala  2008 1.463660e+05     0.066139
               Sri Lanka  2013 2.585000e+03     0.065979
                   Malta  2005 4.383400e+04     0.062179
                 Lebanon  2011 4.588368e+06     0.057925
                   Tonga  2007 1.235700e+04     0.057148
                  Malawi  2005 1.339711e+06     0.056856
                   Malta  2008 4.937900e+04     0.056823
                Portugal  2001 1.362722e+06     0.056457
                 Vanuatu  2009 2.378500e+04     0.055516
                 Belgium  2010 1.895586e+06     0.055159
                  Jordan  2012 7.992573e+06     0.055133
             Philippines  2009 9.222879e+06    -0.054245
                  Jordan  2013 8.413464e+06     0.052660
         Solomon Islands  2009 5.167900e+04    -0.051361
                  Jordan  2009 6.821116e+06     0.051048
                 Romania  2010 2.246871e+06    -0.050947
                   Gabon  2005 1.431260e+05     0.049119
            Sierra Leone  2003 5.199549e+06     0.048885
                 Lebanon  2002 3.522837e+06     0.048507
                Pakistan  2010 1.756182e+06     0.048493
               Australia  2007 2.827600e+04     0.048074
                Maldives  2007 3.490000e+02     0.048048
                  Jordan  2008 6.489822e+06     0.047896
                 Albania  2005 3.114870e+05    -0.047263
                  Israel  2007 7.181000e+03    -0.047234
                   Kenya  2004 3.574931e+06     0.047184
       Equatorial Guinea  2008 8.684180e+05     0.047136
                 Austria  2002 8.819570e+05     0.047091
                 Belgium  2006 1.547958e+06     0.046896
       Equatorial Guinea  2007 8.293270e+05     0.046843
                  Norway  2012 5.185730e+05     0.046802
            Sierra Leone  2004 5.439695e+06     0.046186
       Equatorial Guinea  2006 7.922170e+05     0.046084
            Sierra Leone  2002 4.957216e+06     0.046014
       Equatorial Guinea  2011 9.942900e+04     0.045367
       Equatorial Guinea  2005 7.573170e+05     0.044839
                 Romania  2011 2.147528e+06    -0.044214
             El Salvador  2006 6.564780e+05     0.043750
                  Jordan  2007 6.193191e+06     0.043638
       Equatorial Guinea  2004 7.248170e+05     0.043486
                 Liberia  2008 3.662993e+06     0.042717
                Portugal  2002 1.419631e+06     0.041761
       Equatorial Guinea  2001 6.397620e+05     0.041410
             El Salvador  2007 6.834750e+05     0.041124
       Equatorial Guinea  2015 1.175389e+06     0.040698
                 Liberia  2007 3.512932e+06     0.040610
                 Liberia  2009 3.811528e+06     0.040550
               Guatemala  2006 1.339780e+05    -0.040465
                 Belgium  2005 1.478617e+06     0.040446
       Equatorial Guinea  2010 9.511400e+04    -0.040329
            Sierra Leone  2005 5.658379e+06     0.040202
                   Niger  2013 1.842637e+07     0.039181
                   Niger  2014 1.914822e+07     0.039175
                   Niger  2015 1.989696e+07     0.039103
              Azerbaijan  2011 9.173820e+05    -0.038718
                   Niger  2010 1.642558e+07     0.038679
                  Jordan  2006 5.934232e+06     0.038522
                 Lebanon  2001 3.359859e+06     0.038479
                   Niger  2009 1.581391e+07     0.038440
            Sierra Leone  2001 4.739147e+06     0.038308
                   Niger  2008 1.522852e+07     0.038190
                Portugal  2013 1.457295e+06    -0.037990
                 Austria  2001 8.422930e+05     0.037861
             South Sudan  2002 7.237276e+06     0.037685
Central African Republic  2002 3.976120e+05     0.037547
                Maldives  2006 3.330000e+02     0.037383
                 Albania  2004 3.269390e+05    -0.037327
                Maldives  2008 3.620000e+02     0.037249
              Mauritania  2010 3.695430e+05     0.037203
                Slovenia  2010 2.485830e+05     0.037193
                   Niger  2004 1.312712e+06     0.037154
                  Greece  2014 1.892413e+06    -0.037043
                 Liberia  2001 2.991132e+06     0.036959
                   Niger  2001 1.177198e+07     0.036907
                   Tonga  2015 1.636400e+04     0.036877
                 Lebanon  2010 4.337141e+06     0.036811
                  Angola  2005 1.955254e+07     0.036406
                  Angola  2011 2.421856e+07     0.036349
                  Angola  2010 2.336913e+07     0.036346
              Montenegro  2002 6.982800e+04     0.036193
                 Liberia  2010 3.948125e+06     0.035838
                  Angola  2014 2.692466e+06     0.035630
                  Uganda  2006 2.955662e+06     0.035478
                  Rwanda  2008 9.781690e+05     0.035382
                  Uganda  2003 2.662482e+06     0.035241
               Swaziland  2010 1.228430e+05     0.035121
             Timor-Leste  2002 9.238250e+05     0.035062
                  Uganda  2009 3.277190e+07     0.034993
                  Uganda  2010 3.391513e+07     0.034885
                  Angola  2015 2.785935e+06     0.034715
                  Angola  2002 1.757265e+07     0.034704
                 Burundi  2007 7.939573e+06     0.034426
                   Gabon  2013 1.817271e+06     0.034411
                 Burundi  2008 8.212264e+06     0.034346
                 Eritrea  2003 3.738265e+06     0.034201
                Pakistan  2007 1.633297e+07     0.034126
                  Uganda  2014 3.883334e+07     0.034074
                 Burundi  2006 7.675338e+06     0.033954
    Syrian Arab Republic  2006 1.891498e+07     0.033910
                    Mali  2007 1.367566e+06     0.033870
                    Iraq  2013 3.388314e+07     0.033761
            Burkina Faso  2010 1.565217e+06     0.033693
            Sierra Leone  2006 5.848692e+06     0.033634
                 Eritrea  2002 3.614639e+06     0.033603
                 Burundi  2005 7.423289e+06     0.033531
                    Chad  2014 1.356944e+07     0.033186
             Afghanistan  2011 2.978599e+06     0.033100
                 Burundi  2004 7.182451e+06     0.032983
                   Gabon  2009 1.586754e+06     0.032767
                 Belgium  2004 1.421137e+06     0.032703
                 Burundi  2010 8.766930e+05     0.032702
                 Belgium  2003 1.376133e+06     0.032524
             South Sudan  2014 1.153971e+06     0.032406
                  Jordan  2005 5.714111e+06     0.032249
                 Eritrea  2004 3.858623e+06     0.032196
                   Gabon  2014 1.875713e+06     0.032159
                 Lebanon  2005 3.986852e+06     0.031990
              Madagascar  2001 1.626932e+06     0.031868
                   Gabon  2008 1.536411e+06     0.031707
                Honduras  2007 7.779720e+05     0.031593
                  Zambia  2014 1.562974e+06     0.031447
                 Burundi  2003 6.953113e+06     0.031379
                    Chad  2010 1.188722e+06     0.031173
                   Tonga  2011 1.457700e+04     0.031124
                  Belize  2001 2.549840e+05     0.031009
                  Malawi  2010 1.516795e+06     0.030808
                    Togo  2010 6.529520e+05     0.030751
                   Gabon  2007 1.489193e+06     0.030695
              Madagascar  2003 1.727914e+07     0.030660
                    Mali  2002 1.163893e+07     0.030609
                  Zambia  2012 1.469994e+07     0.030507
              Mozambique  2003 1.971660e+07     0.030144
              Mozambique  2007 2.218839e+07     0.029745
              Mauritania  2013 3.946170e+05     0.029689
              Mozambique  2008 2.284676e+07     0.029672
              Mozambique  2002 1.913966e+07     0.029636
                 Senegal  2015 1.497699e+07     0.029622
              Mozambique  2010 2.422145e+06     0.029621
                   Tonga  2014 1.578200e+04     0.029619
                    Mali  2014 1.696285e+07     0.029435
              Mozambique  2014 2.721238e+07     0.029432
                   Benin  2006 8.216896e+06     0.029399
                Ethiopia  2001 6.849226e+07     0.029381
                 Ireland  2007 4.398942e+06     0.029332
              Mauritania  2003 2.957117e+06     0.029197
                    Iraq  2001 2.425165e+07     0.029120
               Swaziland  2006 1.125140e+05    -0.028989
              Madagascar  2008 1.999647e+07     0.028968
                   Benin  2007 8.454791e+06     0.028952
               Australia  2010 2.231750e+05     0.028850
                Maldives  2005 3.210000e+02     0.028846
                  Belize  2004 2.768900e+04     0.028834
                  Israel  2008 7.388000e+03     0.028826
                   Benin  2008 8.696916e+06     0.028638
              Uzbekistan  2010 2.856240e+05     0.028631
                    Iraq  2002 2.493930e+07     0.028355
                 Burundi  2002 6.741569e+06     0.028332
                 Belgium  2001 1.286570e+05     0.028228
                Ethiopia  2005 7.672783e+06     0.028186
                  Malawi  2001 1.169586e+07     0.028102
                    Togo  2001 5.111770e+05     0.027766
                Ethiopia  2006 7.885689e+06     0.027748
              Madagascar  2012 2.234657e+07     0.027715
                Cameroon  2007 1.839539e+07     0.027701
                  Zambia  2007 1.272597e+07     0.027660
                Portugal  2003 1.458821e+06     0.027606
                Cameroon  2005 1.742795e+06     0.027603
                    Iraq  2003 2.562763e+07     0.027600
              Madagascar  2013 2.296115e+07     0.027502
                   Ghana  2009 2.393831e+06     0.027455
                    Togo  2008 6.161796e+06     0.027414
                 Senegal  2007 1.187356e+07     0.027412
                 Namibia  2014 2.379920e+05     0.027369
              Madagascar  2015 2.423488e+06     0.027345
                    Togo  2007 5.997385e+06     0.027338
                 Ireland  2006 4.273591e+06     0.027327
                   Kenya  2012 4.364663e+07     0.027298
                Maldives  2011 3.770000e+02     0.027248
              Uzbekistan  2011 2.933940e+05     0.027204
                    Togo  2006 5.837792e+06     0.027189
                 Senegal  2006 1.155676e+07     0.027152
                  Zambia  2004 1.173175e+07     0.027120
                   Kenya  2013 4.482685e+07     0.027040
                    Togo  2012 6.859482e+06     0.026979
         Solomon Islands  2002 4.352620e+05     0.026917
                    Iraq  2009 2.989465e+07     0.026905
                    Togo  2005 5.683268e+06     0.026862
                Ethiopia  2009 8.541625e+07     0.026824
                  Norway  2007 4.791530e+05     0.026734
                   Samoa  2014 1.922900e+04    -0.026725
                 Nigeria  2007 1.464172e+07     0.026661
           Guinea-Bissau  2013 1.681495e+06     0.026467
         Solomon Islands  2003 4.467690e+05     0.026437
                Ethiopia  2013 9.488772e+07     0.026433
                   Sudan  2003 2.943594e+07     0.026373
               Mauritius  2003 1.213370e+05    -0.026352
           Guinea-Bissau  2014 1.725744e+06     0.026315
           Guinea-Bissau  2012 1.638139e+06     0.026304
                  Belize  2003 2.691300e+04     0.026195
                 Nigeria  2005 1.389395e+08     0.026189
                Ethiopia  2014 9.736677e+07     0.026126
                 Senegal  2009 1.255917e+06     0.026112
                Cameroon  2001 1.567193e+07     0.026037
                 Nigeria  2004 1.353936e+08     0.025923
         Solomon Islands  2004 4.583240e+05     0.025863
        Papua New Guinea  2001 5.716152e+06     0.025830
                 Vanuatu  2003 1.989640e+05     0.025820
           Guinea-Bissau  2005 1.388380e+05     0.025808
                  Jordan  2004 5.535595e+06     0.025723
                   Tonga  2012 1.495100e+04     0.025657
                Pakistan  2006 1.579399e+07     0.025606
                  Greece  2008 1.177841e+06     0.025571
        Papua New Guinea  2002 5.862316e+06     0.025570
                   Gabon  2001 1.262259e+06     0.025292
         Solomon Islands  2005 4.698850e+05     0.025225
                   Tonga  2013 1.532800e+04     0.025216
              Mauritania  2009 3.562880e+05     0.025168
                 Iceland  2012 3.271600e+04     0.025130
                  Rwanda  2015 1.162955e+07     0.025050
                 Liberia  2013 4.286291e+06     0.025045
        Papua New Guinea  2006 6.472720e+05     0.025010
                   Haiti  2006 9.494570e+05     0.024951
                 Comoros  2001 5.558880e+05     0.024949
                  Rwanda  2002 8.536250e+05     0.024826
                 Vanuatu  2007 2.199530e+05     0.024782
               Guatemala  2007 1.372860e+05     0.024691
         Solomon Islands  2006 4.814220e+05     0.024553
                 Comoros  2002 5.694790e+05     0.024449
                 Comoros  2010 6.896920e+05     0.024419
           Guinea-Bissau  2004 1.353450e+05     0.024394
                 Comoros  2009 6.732520e+05     0.024380
   Sao Tome and Principe  2004 1.519690e+05     0.024243
                 Comoros  2006 6.264250e+05     0.024194
                 Comoros  2003 5.832110e+05     0.024113
                 Comoros  2005 6.116270e+05     0.024110
                   Sudan  2014 3.773791e+07     0.024098
   Sao Tome and Principe  2003 1.483720e+05     0.024039
                 Comoros  2004 5.972280e+05     0.024034
        Papua New Guinea  2008 6.787187e+06     0.024029
             Timor-Leste  2014 1.212814e+06     0.024020
                   Ghana  2010 2.451214e+06     0.023971
                  Israel  2001 6.439000e+03     0.023851
                 Comoros  2015 7.774240e+05     0.023755
                  Bhutan  2006 6.722280e+05     0.023741
                Zimbabwe  2015 1.577745e+07     0.023734
        Papua New Guinea  2009 6.947447e+06     0.023612
                 Vanuatu  2011 2.418710e+05     0.023598
           Guinea-Bissau  2007 1.445958e+06     0.023565
                   Ghana  2014 2.696256e+07     0.023393
                 Vanuatu  2012 2.474850e+05     0.023211
            Sierra Leone  2013 6.922790e+05     0.023154
                 Ecuador  2004 1.359647e+06     0.023090
   Sao Tome and Principe  2002 1.448890e+05     0.023068
                   Ghana  2015 2.758282e+07     0.023004
                 Vanuatu  2013 2.531420e+05     0.022858
              Tajikistan  2013 8.177890e+05     0.022796
                Honduras  2006 7.541460e+05     0.022789
                   Sudan  2011 3.516731e+07     0.022723
                  Guinea  2013 1.153662e+07     0.022616
              Seychelles  2015 9.341900e+04     0.022548
   Sao Tome and Principe  2015 1.955530e+05     0.022414
                 Vanuatu  2015 2.646300e+04     0.022330
              Tajikistan  2015 8.548651e+06     0.022230
               Guatemala  2010 1.463417e+06     0.022205
                Malaysia  2001 2.369897e+06     0.022138
         Solomon Islands  2012 5.515310e+05     0.022084
                Paraguay  2002 5.586110e+05     0.021929
                  Bhutan  2007 6.869580e+05     0.021912
                  Belize  2014 3.516940e+05     0.021829
                 Namibia  2012 2.263934e+06     0.021806
         Solomon Islands  2013 5.635130e+05     0.021725
Central African Republic  2006 4.217580e+05     0.021723
                  Belize  2015 3.592880e+05     0.021593
             Philippines  2001 7.966532e+07     0.021461
        Papua New Guinea  2014 7.755785e+06     0.021457
                Pakistan  2012 1.779115e+08     0.021398
                Pakistan  2013 1.817126e+08     0.021365
         Solomon Islands  2010 5.277900e+04     0.021285
              Azerbaijan  2008 8.763400e+04     0.021221
                Maldives  2012 3.850000e+02     0.021220
        Papua New Guinea  2015 7.919825e+06     0.021151
                  Brazil  2002 1.815121e+06     0.021127
                Pakistan  2014 1.855463e+08     0.021097
                Djibouti  2001 7.327110e+05     0.021080
              Tajikistan  2005 6.854176e+06     0.021054
               Guatemala  2014 1.592356e+07     0.020989
                Maldives  2001 2.920000e+02     0.020979
                   Sudan  2008 3.295550e+07     0.020846
                  Guinea  2006 9.881428e+06     0.020836
               Australia  2009 2.169170e+05     0.020824
                Maldives  2013 3.930000e+02     0.020779
              Tajikistan  2004 6.712841e+06     0.020673
                  Jordan  2003 5.396774e+06     0.020669
               Guatemala  2015 1.625243e+07     0.020653
                 Ireland  2008 4.489544e+06     0.020596
              Cabo Verde  2003 4.614700e+04     0.020590
                 Algeria  2013 3.833856e+07     0.020570
                 Ecuador  2007 1.425453e+06     0.020551
           Guinea-Bissau  2002 1.293523e+06     0.020521
                   Kenya  2005 3.648288e+06     0.020520
                Malaysia  2013 2.976724e+06     0.020315
                 Algeria  2012 3.756585e+07     0.020269
                 Algeria  2014 3.911331e+07     0.020208
              Tajikistan  2003 6.576877e+06     0.020036
                Malaysia  2009 2.765383e+06     0.019997
                 Vanuatu  2001 1.892900e+04     0.019717
                Malaysia  2004 2.517419e+06     0.019663
                  Rwanda  2005 8.991735e+06     0.019652
           Guinea-Bissau  2001 1.267512e+06     0.019532
                   China  2006 1.311200e+04    -0.019444
                 Algeria  2011 3.681956e+07     0.019434
                 Algeria  2015 3.987153e+07     0.019385
                 Liberia  2004 3.176414e+06     0.019312
                 Eritrea  2008 4.232636e+06     0.019094
                Mongolia  2012 2.814226e+06     0.019087
              Tajikistan  2002 6.447688e+06     0.019055
                Malaysia  2006 2.614357e+07     0.018869
             Philippines  2005 8.627424e+07     0.018845
                 Iceland  2008 3.174140e+05     0.018770
               Swaziland  2012 1.248158e+06     0.018690
            Turkmenistan  2013 5.366277e+06     0.018687
               Swaziland  2013 1.271456e+06     0.018666
                  Israel  2011 7.765800e+04     0.018653
            Turkmenistan  2014 5.466241e+06     0.018628
                Malaysia  2011 2.863513e+07     0.018598
              Bangladesh  2002 1.366667e+06     0.018596
                  Guinea  2002 9.137345e+06     0.018527
                  Bhutan  2010 7.276410e+05     0.018452
                Malaysia  2007 2.662584e+07     0.018447
                Djibouti  2002 7.462210e+05     0.018438
                  Israel  2010 7.623600e+04     0.018435
                Kiribati  2015 1.124700e+04    -0.018415
                   Nepal  2010 2.723137e+06     0.018333
                  Panama  2007 3.453870e+05     0.018255
                   Spain  2003 4.218764e+07     0.018249
                 Germany  2012 8.425823e+06     0.018228
                Honduras  2005 7.373430e+05     0.018214
              Costa Rica  2001 3.996798e+06     0.018178
                Mongolia  2015 2.976877e+06     0.018120
            Turkmenistan  2015 5.565284e+06     0.018119
                  Jordan  2002 5.287488e+06     0.018101
      Dominican Republic  2004 9.129980e+05     0.018089
                 Namibia  2001 1.933596e+06     0.018080
               Swaziland  2008 1.158897e+06     0.017975
                  Panama  2009 3.579385e+06     0.017950
                 Romania  2002 2.173496e+06    -0.017938
               Swaziland  2015 1.319110e+05     0.017855
                  Panama  2010 3.643222e+06     0.017835
                Kiribati  2003 8.889500e+04     0.017769
                 Ecuador  2001 1.285276e+07     0.017750
                Djibouti  2012 8.811850e+05     0.017609
                Djibouti  2013 8.966880e+05     0.017593
                   Nepal  2004 2.539449e+06     0.017561
                   Spain  2004 4.292190e+07     0.017404
                Djibouti  2011 8.659370e+05     0.017378
                  Panama  2013 3.838462e+06     0.017367
                Kiribati  2002 8.734300e+04     0.017296
                Cambodia  2003 1.285312e+07     0.017285
                Djibouti  2014 9.121640e+05     0.017259
                  Rwanda  2003 8.683460e+05     0.017245
                Botswana  2009 1.979882e+06     0.017228
                  Panama  2006 3.391950e+05     0.017183
                 Algeria  2009 3.546576e+06     0.017168
                Portugal  2004 1.483861e+06     0.017165
                Slovenia  2011 2.528430e+05     0.017137
                Maldives  2002 2.970000e+02     0.017123
              Uzbekistan  2008 2.732800e+04     0.017121
               Australia  2013 2.311735e+07     0.017120
                Slovenia  2012 2.571590e+05     0.017070
                   Spain  2006 4.439732e+07     0.017047
                   Spain  2005 4.365316e+07     0.017037
                 Ireland  2002 3.931947e+06     0.016994
                  Cyprus  2003 9.935630e+05     0.016988
                 Ecuador  2009 1.469128e+07     0.016869
                   Nepal  2002 2.456634e+07     0.016744
                Djibouti  2015 9.274140e+05     0.016718
                Botswana  2008 1.946351e+06     0.016682
                  Mexico  2009 1.155523e+07     0.016632
                Djibouti  2003 7.586150e+05     0.016609
                  Greece  2005 1.987314e+06     0.016456
                Cambodia  2012 1.477687e+07     0.016438
                 Ireland  2003 3.996521e+06     0.016423
                  Latvia  2009 2.141669e+06    -0.016375
                   Haiti  2002 8.834733e+06     0.016355
                  Turkey  2013 7.578733e+07     0.016327
              Luxembourg  2004 4.589500e+04     0.016208
                   Spain  2008 4.595416e+06     0.016082
              Luxembourg  2006 4.726370e+05     0.016078
                   Haiti  2003 8.976552e+06     0.016052
                Botswana  2007 1.914414e+06     0.016015
                   India  2005 1.144119e+09     0.015969
                 Ecuador  2012 1.541967e+07     0.015965
                  Bhutan  2013 7.649610e+05     0.015929
             Philippines  2007 8.929349e+06     0.015920
                   Haiti  2004 9.119178e+06     0.015889
      Dominican Republic  2001 8.697126e+06     0.015708
                 Ecuador  2013 1.566155e+07     0.015687
              Seychelles  2014 9.135900e+04     0.015676
            South Africa  2014 5.414673e+07     0.015658
                   India  2006 1.161978e+09     0.015609
              Luxembourg  2007 4.799930e+05     0.015564
      Dominican Republic  2002 8.832285e+06     0.015541
                Botswana  2006 1.884238e+06     0.015295
                Mongolia  2009 2.668289e+06     0.015280
                   India  2007 1.179681e+09     0.015236
                  Cyprus  2015 1.169850e+05     0.015151
                Paraguay  2006 5.882796e+06     0.015064
                  Bhutan  2014 7.764480e+05     0.015016
                Cambodia  2007 1.367669e+07     0.015006
                 Morocco  2015 3.483322e+06     0.014989
              Uzbekistan  2012 2.977450e+05     0.014830
                 Namibia  2002 1.962147e+06     0.014766
              Cabo Verde  2005 4.745670e+05     0.014761
              Kazakhstan  2015 1.754413e+07     0.014743
                 Morocco  2013 3.382477e+07     0.014729
                 Algeria  2006 3.377792e+07     0.014704
                Slovenia  2008 2.213160e+05     0.014643
              Bangladesh  2005 1.434311e+07     0.014544
      Dominican Republic  2006 9.371338e+06     0.014481
                 Morocco  2012 3.333379e+07     0.014455
                  Canada  2002 3.136200e+04    -0.014362
                   India  2009 1.214271e+08     0.014304
                Botswana  2002 1.779953e+06     0.014256
                Paraguay  2007 5.966159e+06     0.014171
                 Ukraine  2010 4.587700e+04    -0.014098
                  Bhutan  2015 7.873860e+05     0.014087
               Indonesia  2002 2.175859e+06     0.014078
                Colombia  2003 4.215215e+07     0.013943
                  Turkey  2004 6.778550e+05     0.013868
                  Turkey  2010 7.232691e+07     0.013846
               Indonesia  2006 2.298382e+07     0.013786
            South Africa  2008 4.955757e+07     0.013782
                 Algeria  2003 3.243514e+06     0.013742
                 Finland  2002 5.259800e+04     0.013683
               Indonesia  2008 2.361593e+08     0.013606
                Portugal  2007 1.542964e+06     0.013582
               Australia  2001 1.941300e+04     0.013575
                Paraguay  2012 6.379219e+06     0.013575
                Colombia  2004 4.272416e+07     0.013570
                 Lesotho  2015 2.174645e+06     0.013450
                 Lesotho  2014 2.145785e+06     0.013424
                    Peru  2001 2.626136e+07     0.013370
      Dominican Republic  2010 9.897985e+06     0.013332
               Argentina  2003 3.839379e+06     0.013313
                 Albania  2011 2.951950e+05     0.013298
                Paraguay  2015 6.639119e+06     0.013206
                Honduras  2004 7.241530e+05    -0.013175
                   India  2012 1.263659e+08     0.013167
                    Peru  2011 2.975999e+07     0.013153
                Colombia  2005 4.328563e+07     0.013142
                Mongolia  2007 2.591670e+05     0.013115
               Nicaragua  2008 5.594560e+05     0.013111
                 Georgia  2010 3.926000e+03    -0.013072
                 Georgia  2011 3.875000e+03    -0.012990
                 Georgia  2014 3.727000e+03    -0.012977
                  Guyana  2006 7.496100e+04    -0.012970
                 Ireland  2015 4.676835e+06     0.012910
                 Georgia  2012 3.825000e+03    -0.012903
               Nicaragua  2007 5.522160e+05     0.012848
                  Latvia  2001 2.337170e+05    -0.012832
                 Georgia  2013 3.776000e+03    -0.012810
             Switzerland  2008 7.647675e+06     0.012787
                Colombia  2006 4.383572e+07     0.012708
                  Norway  2009 4.828726e+06     0.012691
              Azerbaijan  2014 9.535790e+05     0.012635
             Switzerland  2009 7.743831e+06     0.012573
               Nicaragua  2010 5.737723e+06     0.012555
                  Norway  2010 4.889252e+06     0.012535
                    Peru  2004 2.727319e+07     0.012453
                 Morocco  2009 3.198990e+07     0.012439
                 Namibia  2003 1.986535e+06     0.012429
               Australia  2003 1.989540e+05     0.012416
                  Brazil  2004 1.847385e+08     0.012365
              Uzbekistan  2002 2.527185e+06     0.012314
              Cabo Verde  2015 5.329130e+05     0.012302
                Colombia  2007 4.437457e+07     0.012292
            South Africa  2003 4.641819e+07     0.012271
                    Peru  2007 2.829272e+07     0.012264
                 Algeria  2004 3.283196e+06     0.012234
                  Canada  2012 3.475545e+06     0.012016
                  Uganda  2012 3.636796e+06     0.012007
            South Africa  2002 4.585548e+07     0.011973
                   Chile  2001 1.544497e+07     0.011939
                Suriname  2002 4.834400e+04     0.011931
                Colombia  2011 4.646646e+06     0.011923
                 Germany  2011 8.274983e+06     0.011897
                 Morocco  2008 3.159686e+07     0.011880
               Indonesia  2015 2.581621e+08     0.011880
                Mongolia  2005 2.526446e+06     0.011861
                   Nepal  2006 2.594618e+06     0.011828
                   Nepal  2012 2.764992e+07     0.011812
                  Brazil  2005 1.869174e+08     0.011795
                   India  2013 1.278562e+08     0.011794
                Zimbabwe  2001 1.236616e+07     0.011775
                   Nepal  2015 2.865628e+07     0.011759
              Uzbekistan  2003 2.556765e+06     0.011705
              Uzbekistan  2004 2.586435e+06     0.011605
                   Chile  2002 1.562364e+07     0.011568
                 Morocco  2001 2.918183e+07     0.011515
                  Canada  2009 3.362857e+07     0.011514
             Switzerland  2015 8.282396e+06     0.011448
                Zimbabwe  2004 1.277751e+07     0.011367
                  Brazil  2007 1.912664e+07     0.011327
                 Morocco  2002 2.951237e+07     0.011327
                Suriname  2001 4.777400e+04     0.011325
                   Chile  2003 1.579954e+07     0.011259
              Seychelles  2003 8.278100e+04    -0.011251
                 Morocco  2003 2.984394e+07     0.011235
                 Iceland  2014 3.273860e+05     0.011187
                Mongolia  2004 2.496832e+06     0.011155
               Argentina  2002 3.788937e+06     0.011149
                  Canada  2014 3.554456e+07     0.011068
              Azerbaijan  2006 8.484550e+05     0.011046
               Lithuania  2009 3.162916e+06    -0.011042
                   Chile  2004 1.597378e+07     0.011028
                  Latvia  2002 2.311730e+05    -0.010885
                  Canada  2008 3.324577e+07     0.010881
                Slovenia  2013 2.599530e+05     0.010865
              Cabo Verde  2008 4.917230e+05     0.010865
                  Latvia  2004 2.263122e+06    -0.010854
               Argentina  2005 3.914549e+07     0.010762
                  Latvia  2005 2.238799e+06    -0.010748
                 Austria  2015 8.633169e+06     0.010723
             Switzerland  2012 7.996861e+06     0.010675
              Cabo Verde  2009 4.969630e+05     0.010656
                   Malta  2015 4.318740e+05     0.010553
                   Chile  2007 1.649169e+07     0.010533
              Luxembourg  2002 4.461750e+05     0.010532
               Argentina  2011 4.165688e+07     0.010503
                Mongolia  2003 2.469286e+06     0.010487
               Argentina  2007 3.997224e+06     0.010449
               Argentina  2014 4.298152e+07     0.010381
                   Chile  2008 1.666194e+07     0.010324
               Lithuania  2008 3.198231e+06    -0.010232
                 Ireland  2009 4.535375e+06     0.010208
                Paraguay  2004 5.737400e+04     0.010195
               Argentina  2015 4.341776e+07     0.010150
                  Canada  2004 3.199500e+04     0.010071
               Lithuania  2013 2.957689e+06    -0.010069
                   Chile  2009 1.682944e+07     0.010053
                  Canada  2003 3.167600e+04     0.010012
                 Ukraine  2001 4.868386e+07    -0.010005
                  Guinea  2004 9.492290e+05     0.009981
                   Nepal  2008 2.647586e+07     0.009957
               Lithuania  2002 3.443670e+05    -0.009922
                  Canada  2005 3.231200e+04     0.009908
                Mongolia  2002 2.443659e+06     0.009870
                Portugal  2008 1.558177e+06     0.009860
                Colombia  2013 4.734298e+07     0.009844
                   Nepal  2005 2.564287e+06     0.009781
                  Brazil  2010 1.967963e+08     0.009750
      Dominican Republic  2008 9.636520e+05     0.009744
                   Chile  2010 1.699335e+07     0.009740
                  Brazil  2011 1.986867e+08     0.009606
                   China  2012 1.356950e+05     0.009538
                Colombia  2014 4.779191e+07     0.009483
                   Malta  2014 4.273640e+05     0.009424
                   Chile  2011 1.715336e+07     0.009416
                   Malta  2013 4.233740e+05     0.009343
                Mongolia  2001 2.419776e+06     0.009318
            South Africa  2001 4.531294e+07     0.009267
                 Myanmar  2014 5.192418e+07     0.009252
                 Lesotho  2001 1.885955e+06     0.009234
                    Fiji  2011 8.678600e+04     0.009198
                 Denmark  2004 5.445230e+05     0.009172
                Colombia  2015 4.822870e+07     0.009139
                  Latvia  2006 2.218357e+06    -0.009131
             Switzerland  2007 7.551117e+06     0.008977
                 Iceland  2002 2.875230e+05     0.008966
                  Brazil  2008 1.929793e+07     0.008956
                    Fiji  2007 8.348120e+05     0.008945
                Suriname  2013 5.425400e+04     0.008870
                   Chile  2014 1.761380e+07     0.008636
                  Sweden  2010 9.378126e+06     0.008562
                Honduras  2015 8.968290e+05     0.008561
               Lithuania  2014 2.932367e+06    -0.008561
                  Sweden  2009 9.298515e+06     0.008555
                 Lesotho  2007 1.982287e+06     0.008458
                   Chile  2015 1.776268e+07     0.008453
     Trinidad and Tobago  2004 1.295350e+05     0.008431
                 Lesotho  2006 1.965662e+06     0.008268
                 Belarus  2007 9.569530e+05    -0.008261
                 Lesotho  2005 1.949543e+06     0.008179
                  Latvia  2015 1.977527e+06    -0.008153
                   Samoa  2012 1.891940e+05     0.008147
               Mauritius  2001 1.196287e+06     0.007932
                Bulgaria  2003 7.775327e+06    -0.007890
                   Italy  2015 6.735820e+05    -0.007854
                   Malta  2012 4.194550e+05     0.007656
                 Tunisia  2003 9.939678e+06     0.007639
                Botswana  2004 1.829330e+05    -0.007627
               Sri Lanka  2009 1.996800e+04     0.007620
               Sri Lanka  2001 1.879700e+04     0.007612
             Switzerland  2002 7.284753e+06     0.007593
                  Sweden  2011 9.449213e+06     0.007580
               Sri Lanka  2008 1.981700e+04     0.007576
               Sri Lanka  2002 1.893900e+04     0.007554
               Sri Lanka  2005 1.937300e+04     0.007541
                 Belarus  2010 9.495830e+05    -0.007507
                 Myanmar  2006 4.884647e+07     0.007505
                  Sweden  2012 9.519374e+06     0.007425
                 Austria  2014 8.541575e+06     0.007335
                  Poland  2010 3.842794e+06     0.007242
              Luxembourg  2013 5.433600e+04     0.007229
                 Georgia  2001 4.386400e+04    -0.007220
                 Ukraine  2008 4.625820e+05    -0.007196
                 Denmark  2015 5.683483e+06     0.007089
                   Samoa  2009 1.848260e+05     0.007083
                    Fiji  2013 8.797150e+05     0.007004
                  France  2006 6.362138e+07     0.006996
                Bulgaria  2008 7.492561e+06    -0.006995
                 Belarus  2003 9.796749e+06    -0.006974
                 Iceland  2003 2.895210e+05     0.006949
                 Austria  2005 8.227829e+06     0.006836
                    Fiji  2006 8.274110e+05     0.006807
                   Samoa  2008 1.835260e+05     0.006802
                 Albania  2009 2.927519e+06    -0.006716
                  Guyana  2015 7.685140e+05     0.006708
                 Myanmar  2007 4.917159e+07     0.006656
                 Estonia  2008 1.337900e+04    -0.006608
                   Malta  2003 3.985820e+05     0.006599
                Bulgaria  2010 7.395599e+06    -0.006561
                  Guyana  2013 7.588100e+04     0.006499
                  Malawi  2015 1.757367e+06    -0.006485
                Portugal  2009 1.568247e+06     0.006463
             Switzerland  2005 7.437115e+06     0.006427
                Bulgaria  2009 7.444443e+06    -0.006422
                   Samoa  2005 1.799290e+05     0.006421
                Bulgaria  2011 7.348328e+06    -0.006392
                Bulgaria  2015 7.177991e+06    -0.006360
                 Estonia  2001 1.388115e+06    -0.006349
                 Belarus  2002 9.865548e+06    -0.006345
                 Jamaica  2003 2.712511e+06     0.006331
                 Ukraine  2005 4.715150e+05    -0.006324
                   Samoa  2004 1.787810e+05     0.006298
             Switzerland  2006 7.483934e+06     0.006295
                 Myanmar  2008 4.947975e+07     0.006267
                 Austria  2004 8.171966e+06     0.006223
Central African Republic  2011 4.476153e+06     0.006211
                 Romania  2005 2.131968e+07    -0.006156
                   Samoa  2003 1.776620e+05     0.006116
                   Tonga  2003 9.978900e+04     0.006100
                 Jamaica  2004 2.728777e+06     0.005997
                 Denmark  2008 5.493621e+06     0.005893
                  Norway  2003 4.564855e+06     0.005883
                 Austria  2013 8.479375e+06     0.005858
                 Jamaica  2005 2.744673e+06     0.005825
                   Tonga  2002 9.918400e+04     0.005811
                   Samoa  2002 1.765820e+05     0.005787
                 Romania  2004 2.145175e+07    -0.005682
                Bulgaria  2014 7.223938e+06    -0.005668
                Thailand  2007 6.619562e+07     0.005643
                  Norway  2002 4.538159e+06     0.005407
                  Poland  2011 3.863255e+06     0.005325
               Mauritius  2015 1.262650e+05    -0.005270
                  France  2013 6.599857e+06     0.005160
                   China  2008 1.324655e+06     0.005137
     Trinidad and Tobago  2012 1.341588e+06     0.005094
                   China  2015 1.371220e+05     0.005094
                 Denmark  2014 5.643475e+06     0.005083
                   China  2014 1.364270e+05     0.005076
             El Salvador  2015 6.312478e+06     0.004981
     Trinidad and Tobago  2013 1.348248e+06     0.004964
                 Austria  2006 8.268641e+06     0.004960
                   Italy  2005 5.796948e+07     0.004926
                 Belgium  2013 1.118282e+07     0.004904
                   China  2010 1.337750e+05     0.004875
                  Serbia  2012 7.199770e+05    -0.004868
                 Finland  2009 5.338871e+06     0.004794
                 Finland  2012 5.413971e+06     0.004769
                   China  2011 1.344130e+05     0.004769
     Trinidad and Tobago  2009 1.321618e+06     0.004748
               Mauritius  2006 1.233996e+06     0.004675
                 Finland  2011 5.388272e+06     0.004646
              Seychelles  2005 8.285800e+04     0.004644
     Trinidad and Tobago  2014 1.354493e+06     0.004632
                 Finland  2013 5.438972e+06     0.004618
                   Spain  2010 4.657690e+07     0.004615
                 Finland  2010 5.363352e+06     0.004585
                 Austria  2012 8.429991e+06     0.004570
  Bosnia and Herzegovina  2009 3.746561e+06    -0.004527
             El Salvador  2010 6.164626e+06     0.004456
                 Denmark  2007 5.461438e+06     0.004445
                    Fiji  2005 8.218170e+05     0.004232
                 Denmark  2013 5.614932e+06     0.004178
                 Finland  2014 5.461512e+06     0.004144
                 Ireland  2014 4.617225e+06     0.004117
                 Croatia  2014 4.238389e+06    -0.004065
                  Serbia  2007 7.381579e+06    -0.004046
                  Sweden  2004 8.993531e+06     0.003941
              Seychelles  2009 8.729800e+04     0.003933
                 Armenia  2010 2.877311e+06    -0.003903
             Netherlands  2008 1.644559e+07     0.003901
                 Jamaica  2014 2.862870e+05     0.003857
                  Sweden  2003 8.958229e+06     0.003728
              Seychelles  2004 8.247500e+04    -0.003697
                    Fiji  2001 8.142180e+05     0.003692
                Thailand  2011 6.753130e+05     0.003604
                 Denmark  2001 5.358783e+06     0.003590
                 Estonia  2012 1.322696e+06    -0.003573
                   Spain  2011 4.674270e+07     0.003560
                 Estonia  2013 1.317997e+06    -0.003553
                 Uruguay  2015 3.431552e+06     0.003511
Central African Republic  2014 4.515392e+06     0.003498
                 Uruguay  2010 3.374415e+06     0.003467
                 Iceland  2009 3.184990e+05     0.003418
                  Greece  2003 1.928700e+04     0.003382
                  Belize  2007 2.984700e+04     0.003362
                 Uruguay  2011 3.385624e+06     0.003322
                  France  2008 6.437499e+06     0.003315
                 Finland  2015 5.479531e+06     0.003299
                 Uruguay  2012 3.396777e+06     0.003294
                 Denmark  2006 5.437272e+06     0.003292
                 Austria  2007 8.295487e+06     0.003247
                 Denmark  2002 5.375931e+06     0.003200
                 Austria  2008 8.321496e+06     0.003135
                 Estonia  2011 1.327439e+06    -0.003031
                   Italy  2006 5.814398e+07     0.003010
                 Hungary  2013 9.893820e+05    -0.003003
              Montenegro  2012 6.261000e+03    -0.002867
                 Croatia  2013 4.255689e+06    -0.002781
               Argentina  2001 3.747159e+06    -0.002739
                   Italy  2012 5.953972e+07     0.002699
                  Sweden  2001 8.895960e+05     0.002679
               Mauritius  2009 1.247429e+06     0.002659
                 Austria  2009 8.343323e+06     0.002623
                 Estonia  2014 1.314545e+06    -0.002619
                 Ukraine  2011 4.576100e+04    -0.002529
                 Ireland  2013 4.598294e+06     0.002485
      Russian Federation  2013 1.435691e+07     0.002460
                   Japan  2001 1.271490e+05     0.002412
                 Germany  2009 8.192370e+05    -0.002387
                 Albania  2014 2.889140e+05    -0.002341
             Netherlands  2005 1.631987e+07     0.002339
                   Japan  2002 1.274450e+05     0.002328
                 Iceland  2011 3.191400e+04     0.002293
                 Estonia  2010 1.331475e+06    -0.002278
                 Ukraine  2013 4.548960e+05    -0.002274
                 Ireland  2012 4.586897e+06     0.002207
               Mauritius  2013 1.258653e+06     0.002206
                 Armenia  2012 2.881922e+06     0.002205
                  Turkey  2005 6.793460e+05     0.002200
                 Lesotho  2003 1.918970e+05    -0.002158
              Montenegro  2009 6.182940e+05     0.002148
                   Japan  2003 1.277180e+05     0.002142
                    Fiji  2004 8.183540e+05     0.002114
                  Guyana  2008 7.463140e+05    -0.002079
                Pakistan  2008 1.636446e+07     0.001928
              Montenegro  2010 6.194280e+05     0.001834
                    Fiji  2002 8.156910e+05     0.001809
              Montenegro  2008 6.169690e+05     0.001776
              Montenegro  2004 6.133530e+05     0.001774
      Russian Federation  2012 1.432168e+07     0.001735
                   Italy  2011 5.937945e+07     0.001721
              Kazakhstan  2001 1.485834e+07    -0.001699
                 Germany  2002 8.248850e+07     0.001683
                   Japan  2012 1.276290e+05    -0.001596
                 Belarus  2015 9.489616e+06     0.001594
              Montenegro  2005 6.142610e+05     0.001480
                   Japan  2013 1.274450e+05    -0.001442
                 Germany  2007 8.226637e+07    -0.001336
                   Japan  2014 1.272760e+05    -0.001326
              Mauritania  2001 2.797290e+05     0.001324
                   Japan  2009 1.284700e+04    -0.001244
                  Guyana  2010 7.465560e+05     0.001157
              Kazakhstan  2005 1.514729e+06     0.001153
                    Fiji  2003 8.166280e+05     0.001149
                 Germany  2006 8.237645e+07    -0.001127
               Australia  2011 2.234240e+05     0.001116
                 Croatia  2009 4.429780e+05    -0.001082
                   Japan  2015 1.271410e+05    -0.001061
                 Belarus  2012 9.464495e+06    -0.000916
                 Belarus  2014 9.474511e+06     0.000899
              Montenegro  2014 6.218100e+04     0.000869
                  Guyana  2009 7.456930e+05    -0.000832
                 Jamaica  2013 2.851870e+05     0.000684
                 Croatia  2005 4.442000e+03     0.000676
                 Uruguay  2003 3.325637e+06    -0.000642
                   Japan  2006 1.278540e+05     0.000634
                  Poland  2006 3.814127e+07    -0.000634
                 Armenia  2011 2.875581e+06    -0.000601
                 Germany  2003 8.253418e+07     0.000554
  Bosnia and Herzegovina  2004 3.781287e+06     0.000540
                  Guyana  2002 7.518840e+05    -0.000504
      Russian Federation  2010 1.428494e+08     0.000449
                  Poland  2005 3.816544e+07    -0.000439
                  Poland  2002 3.823364e+06    -0.000395
                   Japan  2004 1.277610e+05     0.000337
                   China  2013 1.357380e+05     0.000317
                  Guyana  2004 7.516520e+05    -0.000273
                 Belarus  2013 9.465997e+06     0.000159
                 Tunisia  2014 1.114398e+06    -0.000144
                Suriname  2010 5.261300e+04    -0.000114
                   Japan  2005 1.277730e+05     0.000094
              Kazakhstan  2002 1.485895e+07     0.000041
                  Guyana  2003 7.518570e+05    -0.000036
                  Poland  2012 3.863164e+06    -0.000024
                 Croatia  2002 4.440000e+02     0.000000
                 Croatia  2003 4.440000e+02     0.000000
In [34]:
df = dataframe.sort_values(["Country","Year"]).copy()
df["Population"] = pd.to_numeric(df["Population"])
df.loc[df["Population"] <= 0, "Population"] = np.nan

prev = df.groupby("Country")["Population"].shift(1)
ratio = df["Population"] / prev

spike = df["Population"].notna() & prev.notna() & ((ratio < 0.7) | (ratio > 1.3))

bad_countries = df.loc[spike, "Country"].unique().tolist()
print("Broj država sa spike-ovima:", len(bad_countries))
print("Primer:", bad_countries[:30])
Broj država sa spike-ovima: 143
Primer: ['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia']
In [35]:
dataframe.loc[dataframe["Country"].isin(bad_countries), "Population"] = np.nan
In [36]:
pop = pd.to_numeric(dataframe["Population"])

missing_count = pop.isna().sum()
missing_percent = pop.isna().mean() * 100

print("Nedostajuci redovi:", missing_count)
print("Nedostajuci % feature-a Population", round(missing_percent,2), "%")
Nedostajuci redovi: 2936
Nedostajuci % feature-a Population 99.93 %
In [37]:
url = "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2000:2015&format=json&per_page=20000"

r = requests.get(url)
data = r.json()[1]

pop = pd.DataFrame([{
    "Country": d["country"]["value"],
    "Year": int(d["date"]),
    "PopulationWB": d["value"]
} for d in data if d["value"] is not None])

print(pop.head())
                       Country  Year  PopulationWB
0  Africa Eastern and Southern  2015     607123269
1  Africa Eastern and Southern  2014     590968990
2  Africa Eastern and Southern  2013     575202699
3  Africa Eastern and Southern  2012     559609961
4  Africa Eastern and Southern  2011     544737983
In [38]:
your = set(dataframe["Country"].unique())
worldBankData = set(pop["Country"].unique())
print(your - worldBankData)
{'Bahamas', 'Republic of Korea', "Côte d'Ivoire", 'Gambia', 'Saint Kitts and Nevis', 'Iran (Islamic Republic of)', 'Saint Lucia', 'Slovakia', 'Democratic Republic of the Congo', 'Egypt', 'The former Yugoslav republic of Macedonia', 'Congo', 'Micronesia (Federated States of)', 'Niue', "Democratic People's Republic of Korea", 'United States of America', "Lao People's Democratic Republic", 'Turkey', 'Venezuela (Bolivarian Republic of)', 'United Kingdom of Great Britain and Northern Ireland', 'Yemen', 'Kyrgyzstan', 'United Republic of Tanzania', 'Republic of Moldova', 'Somalia', 'Swaziland', 'Bolivia (Plurinational State of)', 'Cook Islands', 'Saint Vincent and the Grenadines'}
In [39]:
name_map = {
    "Bahamas": "Bahamas, The",
    "Bolivia (Plurinational State of)": "Bolivia",
    "Côte d'Ivoire": "Cote d'Ivoire",
    "Congo": "Congo, Rep.",
    "Democratic Republic of the Congo": "Congo, Dem. Rep.",
    "Democratic People's Republic of Korea": "Korea, Dem. People's Rep.",
    "Egypt": "Egypt, Arab Rep.",
    "Iran (Islamic Republic of)": "Iran, Islamic Rep.",
    "Gambia": "Gambia, The",
    "Kyrgyzstan": "Kyrgyz Republic",
    "Lao People's Democratic Republic": "Lao PDR",
    "United Republic of Tanzania": "Tanzania",
    "Micronesia (Federated States of)": "Micronesia, Fed. Sts.",
    "Republic of Korea": "Korea, Rep.",
    "Republic of Moldova": "Moldova",
    "Saint Vincent and the Grenadines": "St. Vincent and the Grenadines",
    "Saint Lucia": "St. Lucia",
    "Slovakia": "Slovak Republic",
    "Venezuela (Bolivarian Republic of)": "Venezuela, RB",
    "United States of America": "United States",
    "The former Yugoslav republic of Macedonia": "North Macedonia",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Yemen": "Yemen, Rep.",
    "Saint Kitts and Nevis": "St. Kitts and Nevis",
    "Swaziland": "Eswatini",
    "Turkey": "Turkiye"
}
In [40]:
df = dataframe.copy()
df["Year"] = pd.to_numeric(df["Year"]).astype(int)
df["Country_wb"] = df["Country"].str.strip().replace(name_map)
copyDataframe = pop.copy()
copyDataframe["Year"] = pd.to_numeric(copyDataframe["Year"]).astype(int)

merged = df.merge(
    copyDataframe.rename(columns={"Country": "Country_wb"}),
    on=["Country_wb", "Year"],
    how="left"
)

merged["Population"] = pd.to_numeric(merged["Population"], errors="coerce")
merged["Population"] = merged["Population"].fillna(merged["PopulationWB"])

merged.loc[merged["Population"] < 10_000, "Population"] = merged["PopulationWB"]

merged.drop(columns=["PopulationWB"], inplace=True)
dataframe = merged
In [41]:
print("Remaining missing Population:", dataframe["Population"].isna().sum())
Remaining missing Population: 18
In [42]:
print(pop[pop["Country"]=="Somalia"][["Country","Year","PopulationWB"]].sort_values("Year").to_string(index=False))
Empty DataFrame
Columns: [Country, Year, PopulationWB]
Index: []
In [43]:
all_missing = (
    dataframe.groupby("Country")["Population"]
    .apply(lambda s: s.isna().mean())
    .loc[lambda x: x == 1.0]
    .index
)

print("Countries with 100% missing Population:", len(all_missing))
print(all_missing.tolist())
Countries with 100% missing Population: 3
['Cook Islands', 'Niue', 'Somalia']
In [44]:
dataframe = dataframe[~dataframe["Country"].isin(all_missing)].copy()

Proverili smo “Population” po državama kroz godine i tražili ekstremne skokove u odnosu na prethodnu godinu (ratio < 0.7 ili > 1.3). Za populaciju takve promene nisu realne (država ne može da poraste ili padne 30% u jednoj godini bez nekog totalno posebnog slučaja), pa je to jak signal da su podaci u ovoj koloni korumpirani.

Zbog toga smo odlucili da za države koje imaju ovakve spike-ove postavimo Population na NaN za sve godine i popunim je iz pouzdanijeg izvora, World Bank dataset. Ovo nam deluje kao mnogo čistije rešenje nego da pokušavamo da nagađamo ispravnu skalu podataka ili da popunjavamo populaciju koristeći mean/median iz drugih država.

Posle merge-a sa World Bank populacijom, većina vrednosti je uspešno popunjena; za par država (3) nije bilo dostupnih podataka u tom izvoru za traženi period, pa smo te drzave drop-ovali.

GDP

In [68]:
dataframe["GDP"] = pd.to_numeric(dataframe["GDP"])
dataframe.loc[dataframe["GDP"] < 0, "GDP"] = np.nan
dataframe["GDP"].describe()
Out[68]:
count      2490.000000
mean       7483.158469
std       14270.169342
min           1.681350
25%         463.935626
50%        1766.947595
75%        5910.806335
max      119172.741800
Name: GDP, dtype: float64

Distribucija GDP-a je jako asimetrična. Većina država ima relativno niže vrednosti GDP-a, dok mali broj država ima veoma visoke vrednosti, što se vidi iz velikog maksimuma i velike standardne devijacije.

Medijana je dosta manja od srednje vrednosti, što takođe ukazuje na to da nekoliko veoma bogatih država “vuče” prosečnu vrednost naviše.

In [69]:
dataframe["GDP_diff"] = dataframe.groupby("Country")["GDP"].diff().abs()
dataframe.sort_values("GDP_diff", ascending=False)[["Country","Year","GDP","GDP_diff"]].head(20)
Out[69]:
Country Year GDP GDP_diff
1539 Luxembourg 2014 119172.74180 117972.91950
1546 Luxembourg 2007 1618.49280 112675.35050
1545 Luxembourg 2008 114293.84330 101095.17400
1543 Luxembourg 2010 14965.36100 100796.21600
1542 Luxembourg 2011 115761.57700 99012.44100
1541 Luxembourg 2012 16749.13600 97002.71400
1547 Luxembourg 2006 89739.71170 88121.21890
1916 Norway 2009 817.77681 86828.97665
1915 Norway 2010 87646.75346 86071.76736
2076 Qatar 2010 736.22784 85212.51816
2079 Qatar 2007 675.61258 82291.75970
1548 Luxembourg 2005 8289.69641 81450.01529
2074 Qatar 2012 88564.82298 79729.94340
2073 Qatar 2013 8834.87958 78017.83232
2522 Switzerland 2014 85814.58857 76824.74617
1918 Norway 2007 85128.65759 75440.06149
1549 Luxembourg 2004 75716.35180 67426.65539
1179 Iceland 2006 5613.54115 62734.77702
1921 Norway 2004 5757.26916 61018.12524
2077 Qatar 2009 61478.23813 60742.01029

Ovde je izračunata apsolutna razlika GDP-a između uzastopnih godina za svaku državu (GDP_diff) da vidim0 gde su najveće promene. U outputu se vide ekstremno veliki skokovi, posebno za države kao što su Luxembourg, Norway i Qatar.

In [70]:
gdp_suspicious = dataframe[(dataframe['Status'] == 'Developed') & (dataframe['GDP'] < 500)][['Country','Year','GDP']]
print(gdp_suspicious.to_string())
          Country  Year         GDP
125     Australia  2002  281.817630
137       Austria  2006  443.993610
396      Bulgaria  2003  271.468240
397      Bulgaria  2002  287.534843
399      Bulgaria  2000  169.285860
1174      Iceland  2011   46.217000
1282        Italy  2015  349.147550
1289        Italy  2008  464.184650
1296        Italy  2001   24.819000
1297        Italy  2000  251.242600
1536    Lithuania  2001  353.147337
1851  New Zealand  2009  282.941930
2046       Poland  2008  141.446880
2048       Poland  2006   94.772600
2120      Romania  2014   12.277330
2123      Romania  2011   92.277825
2346     Slovenia  2014  242.672860
2426        Spain  2014  296.472250

Iz ispisa se vidi da neke razvijene države imaju veoma nizak GDP (npr. Italy $24 u 2001, Romania $92), što očigledno nema smisla. Ovo najverovatnije ukazuje na grešku u jedinicama ili skali podataka. Verovatno su u nekim redovima pomešani različiti izvori ili metrike (npr. GDP per capita vs ukupni GDP), pa su neke vrednosti pogrešno upisane ili skalirane. Ove vrednosti cemo tretirati kao greške u podacima.

In [80]:
countries = ["Luxembourg","Norway","Qatar","Belgium","New Zealand"]
tmp = dataframe[dataframe["Country"].isin(countries)].sort_values(["Country","Year"])

for country in countries:
    g = tmp[tmp["Country"] == country]

    plt.figure(figsize=(7,3))
    plt.plot(g["Year"], g["GDP"], marker="o")
    plt.title(country)
    plt.xlabel("Godina")
    plt.ylabel("GDP")
    plt.grid(True, alpha=0.3)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Na ovim grafovima se vide nagli padovi i skokovi GDP-a iz godine u godinu (npr. sa ~80k na ~800 pa ponovo nazad). Takve promene nemaju smisla u domenskom smislu, jer GDP per capita obično menja vrednost postepeno kroz vreme, a ne da se promeni desetine ili stotine puta u jednoj godini.

Posebno je sumnjivo što se ovakve promene pojavljuju kod razvijenih i bogatih država kao što su Luxembourg, Norway i Qatar, gde su ekonomske promene obično relativno stabilne.

Ovi grafovi zapravo vizuelno potvrđuju ono što smo već videli u GDP_diff tabeli — najveće razlike dolaze iz nekonzistentnih ili pogrešno skaliranih vrednosti u datasetu.

In [83]:
miss_year = dataframe.groupby("Year")["GDP"].apply(lambda x: x.isna().mean())

plt.figure(figsize=(7,3))
plt.plot(miss_year.index, miss_year*100, marker="o")
plt.title("GDP missing po godini (%)")
plt.xlabel("Godina")
plt.ylabel("Missing (%)")
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Procenat nedostajućih GDP vrednosti je oko 39–41% u prvim godinama (2000–2004), dok je u nekim kasnijim godinama nešto veći, oko 46–55%. Ipak, ne vidi se jasan trend da starije godine imaju više missing podataka, jer npr. 2015 opet pada na oko 40%.

Zbog toga izgleda da nedostajanje GDP vrednosti nije prvenstveno povezano sa godinom, već više sa samim državama ili izvorom podataka. Drugim rečima, deluje da neke države kroz više godina sistematski nemaju GDP podatke.

In [84]:
miss_country = dataframe.groupby("Country")["GDP"].apply(lambda x: x.isna().mean()).sort_values(ascending=False)

top20 = miss_country.head(20).sort_values()

plt.figure(figsize=(8,6))
plt.barh(top20.index, top20.values * 100)
plt.title("Procenat nedostajucih vrednosti")
plt.xlabel("Missing (%)")
plt.ylabel("Drzave")
plt.grid(True, axis="x", alpha=0.3)
plt.show()
No description has been provided for this image

Za neke države GDP nedostaje u 100% redova kroz sve godine. U tom slučaju nemamo nijednu poznatu vrednost za tu državu, pa interpolacija ili imputacija pomoću mediane po državi nije moguća.

In [85]:
miss_status = dataframe.groupby("Status")["GDP"].apply(lambda s: s.isna().mean())

plt.figure()
plt.bar(miss_status.index.astype(str), miss_status.values*100)
plt.title("GDP missingness by Status (%)")
plt.xlabel("Status"); plt.ylabel("Missing %")
plt.show()
No description has been provided for this image

Ovde gledam procenat nedostajućih GDP vrednosti u odnosu na status države. Razlika postoji, ali nije velika — oko 12% za developed i 16% za developing zemlje.

Zbog toga mi ne deluje da missing GDP direktno zavisi od statusa države. Obe grupe imaju sličan procenat nedostajućih vrednosti, pa je verovatnije da problem dolazi iz načina na koji je GDP prikupljan u datasetu, a ne iz toga da li je država razvijena ili u razvoju.

In [106]:
tab = pd.crosstab(dataframe["Status"], dataframe["GDP"])
print((tab.div(tab.sum(axis=1), axis=0)*100).round(2))

chi2, p_chi, dof, exp = stats.chi2_contingency(tab)
print("Chi-square p-value:", p_chi)
GDP         1.681350       3.685949       4.613575       5.668726       \
Status                                                                   
Developed            0.00           0.00           0.00           0.00   
Developing           0.04           0.04           0.04           0.04   

GDP         8.376432       11.147277      11.336780      11.553196      \
Status                                                                   
Developed            0.00           0.00           0.00           0.00   
Developing           0.04           0.04           0.04           0.04   

GDP         11.631377      12.178928      ...  85948.746000   86852.711900   \
Status                                    ...                                 
Developed            0.00           0.00  ...           0.00           0.00   
Developing           0.04           0.04  ...           0.04           0.04   

GDP         87646.753460   87998.444680   88564.822980   89739.711700   \
Status                                                                   
Developed             0.2            0.2           0.00            0.2   
Developing            0.0            0.0           0.04            0.0   

GDP         113751.850000  114293.843300  115761.577000  119172.741800  
Status                                                                  
Developed             0.2            0.2            0.2            0.2  
Developing            0.0            0.0            0.0            0.0  

[2 rows x 2902 columns]
Chi-square p-value: 0.3112029976471268

Chi-square test koristimo da proverim da li su nedostajuce vrednosti za GDP povezane sa Status. Rezultat (p=0.066) kaže da nemamo dovoljno jak dokaz da missingness zavisi od statusa, iako Developing ima malo veći procenat missing GDP.

In [87]:
y = "Life expectancy"
m = dataframe["GDP"].isna()

a = dataframe.loc[m, y].dropna()
b = dataframe.loc[~m, y].dropna()

print("p =", stats.mannwhitneyu(a, b, alternative="two-sided").pvalue)
p = 0.010427159959234789

Ovde koristim Mann–Whitney test da proverim da li se Life expectancy razlikuje između redova gde GDP nedostaje i gde postoji. Test poredi raspodelu vrednosti između ove dve grupe.

Dobijena p-vrednost je p = 0.010, što je manje od 0.05, pa možemo reći da postoji statistički značajna razlika između grupa. To znači da Life expectancy nije isti u redovima gde GDP nedostaje i gde je prisutan.

Zbog toga izgleda da nedostajanje GDP podataka nije potpuno slučajno, već je verovatno povezano sa karakteristikama država.

In [94]:
missing_gdp = dataframe["GDP"].isna()

numeric_cols = dataframe.select_dtypes(include=[np.number]).columns
numeric_cols = [col for col in numeric_cols if col != "GDP"]

rows = []

for col in numeric_cols:
    group_missing = dataframe.loc[missing_gdp, col].dropna()
    group_present = dataframe.loc[~missing_gdp, col].dropna()

    if len(group_missing) < 10 or len(group_present) < 10:
        continue

    p = stats.mannwhitneyu(group_missing, group_present, alternative="two-sided").pvalue

    rows.append({
        "feature": col,
        "pvalue": p,
        "mean_GDP_missing": group_missing.mean(),
        "mean_GDP_present": group_present.mean(),
        "n_missing": len(group_missing),
        "n_present": len(group_present),
    })

result = pd.DataFrame(rows).sort_values("pvalue")
result.head(20)
Out[94]:
feature pvalue mean_GDP_missing mean_GDP_present n_missing n_present
5 percentage expenditure 3.119805e-220 0.000000e+00 8.710772e+02 448 2490
2 Adult Mortality 6.881842e-06 1.799661e+02 1.620922e+02 443 2485
18 Schooling 1.356547e-05 1.122500e+01 1.208170e+01 288 2487
3 infant deaths 1.629540e-05 2.492188e+01 3.127229e+01 448 2490
17 Income composition of resources 4.227555e-05 5.952160e-01 6.312870e-01 287 2484
9 under-five deaths 6.711992e-05 3.416295e+01 4.345221e+01 448 2490
13 HIV/AIDS 8.172494e-03 9.439732e-01 1.885703e+00 448 2490
1 Life expectancy 1.042716e-02 6.840745e+01 6.937066e+01 443 2485
6 Hepatitis B 3.064502e-02 8.210243e+01 8.072642e+01 371 2014
15 thinness 10-19 years 1.784044e-01 4.880137e+00 4.832522e+00 438 2466
10 Polio 1.803821e-01 8.272500e+01 8.251916e+01 440 2479
11 Total expenditure 2.470868e-01 6.300974e+00 5.879074e+00 380 2332
14 Population 2.576610e-01 7.700579e+06 1.280247e+07 22 2264
7 Measles 2.904917e-01 2.771346e+03 2.356305e+03 448 2490
16 thinness 5-9 years 4.255603e-01 4.842694e+00 4.875223e+00 438 2466
8 BMI 6.791662e-01 3.772717e+01 3.842676e+01 438 2466
4 Alcohol 6.876185e-01 4.744636e+00 4.577813e+00 412 2332
12 Diphtheria 7.487622e-01 8.136818e+01 8.249375e+01 440 2479
0 Year 7.919964e-01 2.007571e+03 2.007509e+03 448 2490

Kada GDP nedostaje, vidi se da države u proseku imaju “slabiji” razvojni profil. Adult Mortality je veći, dok su Schooling, Income composition of resources i Life expectancy niži. Statistički testovi pokazuju da su ove razlike značajne.

Ovo ima smisla i u domenskom smislu: GDP per capita je snažno povezan sa nivoom razvoja države. Zemlje sa višim GDP obično imaju bolji zdravstveni sistem, duže školovanje i veću životnu očekivanu dužinu. Zato je logično da redovi bez GDP podataka često izgledaju kao države sa nižim nivoom razvoja.

Takođe se vidi da GDP često nedostaje zajedno sa Population, verovatno zato što oba podatka dolaze iz istih ekonomskih/statističkih izvora. Sa druge strane, Life expectancy je češće dostupna, jer dolazi iz zdravstvenih statistika (WHO).

Zbog toga Schooling, Income composition i Life expectancy mogu dobro da pomognu pri prediktivnoj imputaciji GDP-a, jer su realno povezani sa ekonomskim razvojem države.

In [92]:
pairs = [
    ("Schooling", "GDP"),
    ("Income composition of resources", "GDP"),
    ("Life expectancy", "GDP")
]

for x,y in pairs:
    tmp = dataframe[[x,y]].dropna()
    plt.figure(figsize=(5,4))
    plt.scatter(tmp[x], tmp[y], alpha=0.25, s=10)
    plt.title(f"{y} vs {x}")
    plt.xlabel(x)
    plt.ylabel(y)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Na scatter plotovima se vidi jasan trend: kako rastu Schooling, Income composition of resources i Life expectancy, u proseku raste i GDP. Najveće GDP vrednosti se uglavnom pojavljuju kod većih vrednosti ovih indikatora (npr. life expectancy oko 75–85, schooling oko 12–18 i income composition oko 0.7–0.9).

Takođe se vidi da je distribucija GDP-a veoma asimetrična, mnogo tačaka je pri nižim vrednostima, dok mali broj ide do veoma velikih vrednosti. Zbog toga grafik izgleda zbijeno u donjem delu, uz nekoliko ekstremnih outliera, ali se i dalje jasno vidi pozitivan odnos između ovih promenljivih i GDP-a.

Zbog ovoga ima smisla da GDP imputiram prediktivno, koristeći druge indikatore razvoja, umesto da ga popunjavam nasumično ili jednostavno medianom između država.

In [105]:
cols = ["GDP","Schooling","Income composition of resources","Life expectancy","Adult Mortality","Total expenditure","Alcohol"]
print(dataframe[cols].corr(numeric_only=True)["GDP"].sort_values(ascending=False))
GDP                                1.000000
Life expectancy                    0.457943
Income composition of resources    0.447561
Schooling                          0.440318
Alcohol                            0.356166
Total expenditure                  0.139819
Adult Mortality                   -0.300663
Name: GDP, dtype: float64
In [103]:
ct_gdp_status = pd.crosstab(dataframe['GDP'].isnull(), dataframe['Status'])
chi2_gdp, p_gdp_chi, dof_gdp, _ = chi2_contingency(ct_gdp_status)
print(f"  Chi2={chi2_gdp:.2f}, dof={dof_gdp}, p={p_gdp_chi:.4f}")
  Chi2=0.00, dof=0, p=1.0000

FAIL TO REJECT H0: GDP missing roughly equally across Status groups This suggests the gap is about WHICH country, not development status per se conflict/island nations are distributed across both categories

In [ ]:
gdp = dataframe["GDP"].copy()

country_med = dataframe.groupby("Country")["GDP"].transform("median")

collapsed = gdp.notna() & country_med.notna() & (gdp < 0.2 * country_med)

dataframe.loc[collapsed, "GDP"] = np.nan

Koristila sam medianu po državi kao referentnu vrednost jer je otpronija na outliere od srednje vrednosti. U ovom datasetu već postoje ekstremno pogrešne GDP vrednosti, pa bi mean bio “povučen” tim velikim ili veoma malim brojevima. Mediana bolje predstavlja tipičan nivo GDP-a za tu državu.

Zato sam kao heuristiku uzela da su vrednosti manje od 20% medijane verovatno greške, a ne realna ekonomska promena.

In [ ]:
df = dataframe.sort_values(["Country","Year"]).copy()

prev = df.groupby("Country")["GDP"].shift(1)
next_ = df.groupby("Country")["GDP"].shift(-1)

bad_prev = prev.notna() & ((df["GDP"] / prev < 0.2) | (df["GDP"] / prev > 5))
bad_next = next_.notna() & ((df["GDP"] / next_ < 0.2) | (df["GDP"] / next_ > 5))

bad_jump = df["GDP"].notna() & (bad_prev | bad_next)

dataframe.loc[bad_jump, "GDP"] = np.nan

Koristim dva pravila: (1) “bad_jump” hvata godine gde GDP naglo promeni u odnosu na susedne godine, što je tipično znak greške. (2) pravilo sa medianom hvata vrednosti koje su generalno preniske u odnosu na tipičan nivo te države, čak i ako susedne godine nisu dostupne. Nisu ista stvar, ali se dopunjuju.

In [308]:
dataframe = dataframe.sort_values(["Country","Year"])
dataframe["GDP"] = dataframe.groupby("Country")["GDP"].transform(
    lambda s: s.interpolate(limit_direction="both")
)

U državama gde postoje neke GDP vrednosti kroz godine, GDP per capita se obično menja postepeno, a ne naglo. Zbog toga ima smisla koristiti interpolaciju unutar iste države — ona popunjava nedostajuće godine prateći trend između postojećih vrednosti.

Mean ili median po državi bi dali istu vrednost za sve nedostajuće godine u toj državi. Time bi se izgubio vremenski trend, jer GDP per capita kroz godine obično raste ili opada postepeno. Na primer, ako država ima GDP 2000 → 2005 → 2010, mean bi ubacio istu vrednost između njih, što ne prati realno kretanje ekonomije.

In [95]:
work = dataframe.copy()

work["Status_encoded"] = work["Status"].map({"Developing": 0, "Developed": 1})

knn_features = [
    "GDP",
    "Life expectancy",
    "Schooling",
    "Income composition of resources",
    "Adult Mortality",
    "Total expenditure",
    "Alcohol",
    "Status_encoded"
]

work[knn_features] = work[knn_features].apply(pd.to_numeric)

knn_imputer = KNNImputer(n_neighbors=5, weights="distance")
work[knn_features] = knn_imputer.fit_transform(work[knn_features])

dataframe["GDP"] = work["GDP"]

GDP sam imputirala prediktivno (KNN), umesto mean/median ili interpolacije. Mean/median bi ignorisali razlike u razvoju između država, a interpolacija nije moguća za zemlje kojima nedostaju čitavi blokovi GDP podataka.

Pošto scatter grafici pokazuju jasnu vezu između GDP-a i razvojnih indikatora (schooling, income composition, life expectancy), GDP se može razumno proceniti na osnovu sličnih država sa sličnim vrednostima tih indikatora.

In [85]:
dataframe = dataframe.drop(columns=["GDP_diff"], errors="ignore")

HEPATITIS B

In [86]:
dataframe['Hepatitis B'] = pd.to_numeric(dataframe['Hepatitis B'], errors='coerce')
In [87]:
col = 'Hepatitis B'
m = dataframe[col].isna()

m.mean(), m.sum()
Out[87]:
(0.18493150684931506, 540)
In [88]:
hepb = dataframe['Hepatitis B'].dropna()

print(f"  Range: min={hepb.min()}, max={hepb.max()}")
print(f"  Values == 0: {(hepb == 0).sum()} rows")
print(f"  Values < 5: {(hepb < 5).sum()} rows")
  Range: min=1.0, max=99.0
  Values == 0: 0 rows
  Values < 5: 9 rows

ovde je sve ok, snaity chek prosao

In [89]:
dataframe.loc[~m, col].describe()
Out[89]:
count    2380.000000
mean       80.974790
std        25.053021
min         1.000000
25%        77.000000
50%        92.000000
75%        97.000000
max        99.000000
Name: Hepatitis B, dtype: float64
In [90]:
col = "Hepatitis B"
dataframe[col] = pd.to_numeric(dataframe[col], errors="coerce")

missing_rate = dataframe[col].isna().mean()*100
print("Missing %:", round(missing_rate,2))
print(dataframe[col].describe())
print("min/max:", dataframe[col].min(), dataframe[col].max())
Missing %: 18.49
count    2380.000000
mean       80.974790
std        25.053021
min         1.000000
25%        77.000000
50%        92.000000
75%        97.000000
max        99.000000
Name: Hepatitis B, dtype: float64
min/max: 1.0 99.0
In [91]:
tmp = dataframe[["Hepatitis B","Polio"]].dropna()
plt.figure(figsize=(5,4))
plt.scatter(tmp["Polio"], tmp["Hepatitis B"], alpha=0.25, s=10)
plt.xlabel("Polio %"); plt.ylabel("Hepatitis B %")
plt.title("HepB vs Polio (not random if diagonal)")
plt.show()
No description has been provided for this image
In [92]:
tmp = dataframe[["Hepatitis B","Diphtheria"]].dropna()
plt.figure(figsize=(5,4))
plt.scatter(tmp["Diphtheria"], tmp["Hepatitis B"], alpha=0.25, s=10)
plt.xlabel("Diphtheria %"); plt.ylabel("Hepatitis B %")
plt.title("HepB vs Diphtheria (not random if diagonal)")
plt.show()
No description has been provided for this image

Country with strong immunization system → high Polio → high Diphtheria → high HepB

In [93]:
miss_by_year = dataframe.groupby("Year")[col].apply(lambda s: s.isna().mean()).sort_index()

plt.figure()
plt.plot(miss_by_year.index, miss_by_year.values*100)
plt.title("Hepatitis B missingness by year (%)")
plt.xlabel("Year"); plt.ylabel("Missing (%)")
plt.show()
No description has been provided for this image

Imamo veci procenat nedostajucih vrednosti ranijih godina. Mozda je razlog weaker reporting/ incomplete coverage in the source

In [94]:
miss_by_status = dataframe.groupby("Status")[col].apply(lambda s: s.isna().mean())

plt.figure()
plt.bar(miss_by_status.index.astype(str), miss_by_status.values*100)
plt.title("Hepatitis B missingness by Status (%)")
plt.xlabel("Status"); plt.ylabel("Missing (%)")
plt.show()
No description has been provided for this image

Ima vise nedostajucih vrednosti za developed countries nego za developing tso je jako cudno

In [95]:
miss_by_country = dataframe.groupby("Country")[col].apply(lambda s: s.isna().mean()).sort_values(ascending=False)
top = miss_by_country.head(20)[::-1]

plt.figure(figsize=(8,6))
plt.barh(top.index.astype(str), top.values*100)
plt.title("Top 20 countries by Hepatitis B missingness (%)")
plt.xlabel("Missing (%)"); plt.ylabel("Country")
plt.show()
No description has been provided for this image

Neke drzave imaju 100% missing values i to su cak developed such as Finska, Danska, Slovenija, UK, Irska, Svajcarska, Norveska, Japan,.. Ovo uopste nisu poor/developing countries, ne mogu da izvedem zakljucak zasto nedostaju osim da je merge issue neki u pitanju? Za United Kingdom of Great Britain and Northernd Ireland i razumem o tila ova Guinea ili Central African Republic, ali za ostale ne? Mislim ne znam da li samo treba da ziuzmem ove drzave sto imaju po 100% missing values.. jer sigruno je neka greska. kad bih izuzela te drzave, za ostale drzave bih rekla da missing dues to poor sountry i kao nema podataka, ali kako da opravdam svedsku i holandiju i irsku? Dad cud a poevrim da l za ove drzwave i druge stvari pucajo po 100% Mozda ne nedostaju jer nema podataka, vec zato sto u tim drzavama nema hepB vacc programa jer kao read je u severnoj evropi i japanu, tamo ljudi paze sta rade..

In [96]:
import numpy as np
import pandas as pd

col = "Hepatitis B"

miss_rate = dataframe.groupby("Country")[col].apply(lambda s: s.isna().mean())
full_missing_countries = miss_rate[miss_rate == 1.0].index.tolist()

print("Countries with 100% missing HepB:", len(full_missing_countries))
print(full_missing_countries[:30])
Countries with 100% missing HepB: 9
['Denmark', 'Finland', 'Hungary', 'Iceland', 'Japan', 'Norway', 'Slovenia', 'Switzerland', 'United Kingdom of Great Britain and Northern Ireland']
In [97]:
weird = ["Denmark","Norway","Iceland","Finland","Switzerland","Japan"]
dataframe[dataframe["Country"].isin(weird)].isna().mean().sort_values(ascending=False).head(10)
Out[97]:
Hepatitis B                        1.000000
Total expenditure                  0.062500
Alcohol                            0.052083
Country                            0.000000
Schooling                          0.000000
Income composition of resources    0.000000
thinness 5-9 years                 0.000000
thinness 10-19 years               0.000000
Population                         0.000000
GDP                                0.000000
dtype: float64
In [98]:
dataframe[dataframe["Country"].isin(weird)].groupby("Country")["Year"].nunique().sort_values()
Out[98]:
Country
Denmark        16
Finland        16
Iceland        16
Japan          16
Norway         16
Switzerland    16
Name: Year, dtype: int64
In [99]:
plt.figure()
plt.hist(dataframe.loc[~m, col].dropna(), bins=20)
plt.title("Distribution of observed Hepatitis B")
plt.xlabel("Hepatitis B"); plt.ylabel("Count")
plt.show()
No description has been provided for this image

Bas je right skewed distribucija.. Nije normalna.. tako da necemo koristiti mean za imputaciju missing values, nego median

In [100]:
tab = pd.crosstab(dataframe["Status"], m)
print((tab.div(tab.sum(axis=1), axis=0)*100).round(2))
print(stats.chi2_contingency(tab)[:2]) 
Hepatitis B  False  True 
Status                   
Developed    66.21  33.79
Developing   84.76  15.24
(95.14342955568858, 1.7707895774911682e-22)

Status je kategorijska, a missingness je kategorijska takodje - binarna(yes/no), pa proveravamo da li je status faktor zbog missing values. H0 kaze da status ne utice na missingness, a alternativna gipoteza kaze da status utice na missingness. Odbjacujemo H0 jer nam je rezultat 2.74 approx, xnaci dolazimo do zakljucuka da missingness za hep b zavisi od statusa (znaci nije mcar). tj razlika je statisticki znacajna, stopa nedostajucih vrednosti jeste razlicita zimedju develpoed i devepoling.

In [101]:
a = dataframe.loc[m, "Life expectancy"].dropna()
b = dataframe.loc[~m, "Life expectancy"].dropna()

sw_stat_m, sw_p_m = shapiro(a.sample(min(500, len(a)), random_state=42))
sw_stat_p, sw_p_p = shapiro(b.sample(min(500, len(b)), random_state=42))

print(sw_stat_m, sw_p_m, sw_stat_p, sw_p_p)

print(stats.mannwhitneyu(a, b, alternative="two-sided"))
0.9363057321020695 8.735019563944517e-14 0.9474782707797058 2.541632488309031e-12
MannwhitneyuResult(statistic=565124.0, pvalue=1.943676481510776e-05)

ŠTA SMO DOBILI (na ovom dataset-u, uz uzorak do 500 redova po grupi): HepB missing (LE): W=0.934, p=4.48e-14 → NIJE normalno HepB present (LE): W=0.960, p=1.76e-10 → NIJE normalno

ZAKLJUČAK: Pošto su obe p-vrednosti < 0.05, ne pretpostavljamo normalnost, pa biramo neparametarski test: Mann–Whitney U (poređenje rangova).

WHY THIS TEST: If countries missing HepB also have lower life expectancy, that means the missing data is related to health outcomes → MNAR signal.

Ovde smo podelili dataset u 2 grupe: missing i observed. Testiramo da li se life expectancy razlikuje izmejdu grupa. Ovo je neparametarski test za pordjenje 2 nezavisna skupa podatak. I koristmo bas ovaj jer distruibucija nije nromalna da ne bi poredio mean izmedju njih. Dakle kao i za svaki statsticki test, imamo nultu hipotezu H0 kojakaze da je life expectanxy isti za a i b grupu, dok H1 tvrdi da postoji statisticki znacajna razlika u vrednosti rpomenljive life exp za ove dve grupe. Na osnovu rezultata testa, pval = 7.1, odbacujemo H0 i dolazimo do zakljucka da se life exp razlikuje izmedju ove dve grupe, znaci da life expectancy utice na missingness, i topet zakljcujemo missingness nije potpuno slucajan(nje mcar).

Poredimo redove gde hepatiti b nedostaje i tamo gde postoji vrednost. korisimo ovaj test posto nam distribucija hepatitis b promenljive nije nromalna.

TEST 3: Spearman Correlation (HepB vs Polio + Diphtheria) WHY: If HepB correlates strongly with other vaccine rates, we can use those as predictors when imputing HepB (important for imputation strategy). WHY Spearman not Pearson: HepB is left-skewed (non-normal per Shapiro above). Spearman uses rank order → robust to skew. Pearson assumes normality.

In [102]:
pol = "Polio"
diph = "Diphtheria"

mask_full_missing_rows = dataframe["Country"].isin(full_missing_countries) & dataframe[col].isna()

proxy = dataframe[[pol, diph]].mean(axis=1, skipna=True)

dataframe.loc[mask_full_missing_rows, col] = proxy.loc[mask_full_missing_rows]
In [103]:
dataframe[col] = dataframe[col].fillna(dataframe.groupby("Country")[col].transform("median"))

dataframe[col] = dataframe[col].fillna(dataframe.groupby("Status")[col].transform("median"))

dataframe[col] = dataframe[col].fillna(dataframe[col].median())

Hepatitis B ima značajan procenat nedostajućih vrednosti. Vizuelizacije pokazuju da missingness zavisi od godine i zemlje (sistematski obrazac), pa nije MCAR. Zbog toga smo uradili imputaciju hijerarhijski: prvo median po Country (očuva tipičan nivo zemlje kroz godine), zatim fallback median po Status, i na kraju global median. Dodatno smo dodali indikator Hepatitis B_missing da model može da iskoristi informaciju da je vrednost originalno nedostajala.

OTHER

In [104]:
cols = ["Total expenditure", "Alcohol", "Income composition of resources", "Schooling"]
ic = "Income composition of resources"

for c in cols:
    dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
In [105]:
desc = dataframe[cols].describe(percentiles=[.01,.05,.25,.5,.75,.95,.99]).T
missing_pct = (dataframe[cols].isna().mean() * 100).round(3)

print(desc)
print("\nMissing %:")
print(missing_pct)
                                  count       mean       std   min      1%  \
Total expenditure                2710.0   5.938594  2.498713  0.37  1.2309   
Alcohol                          2727.0   4.631492  4.048715  0.01  0.0100   
Income composition of resources  2771.0   0.627551  0.210904  0.00  0.0000   
Schooling                        2775.0  11.992793  3.358920  0.00  2.0720   

                                    5%     25%     50%      75%     95%  \
Total expenditure                1.930   4.260   5.755   7.4975   9.760   
Alcohol                          0.010   0.935   3.790   7.7450  11.974   
Income composition of resources  0.277   0.493   0.677   0.7790   0.892   
Schooling                        5.800  10.100  12.300  14.3000  16.800   

                                     99%     max  
Total expenditure                12.9274  17.600  
Alcohol                          13.4848  17.870  
Income composition of resources   0.9233   0.948  
Schooling                        19.0000  20.700  

Missing %:
Total expenditure                  7.192
Alcohol                            6.610
Income composition of resources    5.103
Schooling                          4.966
dtype: float64
In [106]:
for c in cols:
    s = dataframe[c].dropna()

    plt.figure(figsize=(8,4))
    plt.hist(s, bins=40)
    plt.title(f"{c} distribution (hist)")
    plt.xlabel(c)
    plt.ylabel("Count")
    plt.show()

    plt.figure(figsize=(8,2.5))
    plt.boxplot(s, vert=False)
    plt.title(f"{c} (boxplot)")
    plt.xlabel(c)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [107]:
c = "Total expenditure"
s = pd.to_numeric(dataframe[c], errors="coerce").dropna()

print("mean:", s.mean())
print("median:", s.median())
print("std:", s.std())
print("skew:", s.skew())
print("kurtosis:", s.kurtosis())

q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
print("IQR outliers count:", int(((s < lo) | (s > hi)).sum()))
print("IQR outliers %:", ((s < lo) | (s > hi)).mean()*100)
mean: 5.938594095940959
median: 5.755
std: 2.4987133746041814
skew: 0.6186269030868405
kurtosis: 1.156003910679439
IQR outliers count: 32
IQR outliers %: 1.1808118081180812
In [108]:
cols = ["Total expenditure", "Alcohol", "Income composition of resources", "Schooling"]
ic = "Income composition of resources"

for c in cols:
    dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")

for c in cols:
    s = dataframe[c].dropna()
    print("\n==", c, "==")
    print("missing %:", round(dataframe[c].isna().mean()*100, 3))
    print("mean   :", round(s.mean(), 4))
    print("median :", round(s.median(), 4))
    print("skew   :", round(s.skew(), 4))
== Total expenditure ==
missing %: 7.192
mean   : 5.9386
median : 5.755
skew   : 0.6186

== Alcohol ==
missing %: 6.61
mean   : 4.6315
median : 3.79
skew   : 0.5827

== Income composition of resources ==
missing %: 5.103
mean   : 0.6276
median : 0.677
skew   : -1.1438

== Schooling ==
missing %: 4.966
mean   : 11.9928
median : 12.3
skew   : -0.6024
In [109]:
counts = dataframe.groupby("Year")["Alcohol"].apply(lambda s: (s == 0.01).sum()).sort_index()

plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Alcohol == 0.01 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()

print("Top spike years for Alcohol==0.01:")
print(counts.sort_values(ascending=False).head(10))
No description has been provided for this image
Top spike years for Alcohol==0.01:
Year
2014    86
2013    62
2012    57
2001     9
2000     7
2002     7
2003     7
2011     7
2004     6
2005     5
Name: Alcohol, dtype: int64
In [110]:
counts = dataframe.groupby("Year")[ic].apply(lambda s: (s == 0.0).sum()).sort_index()

plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Income composition == 0.0 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()

print("Top spike years for IncomeComp==0:")
print(counts.sort_values(ascending=False).head(10))
No description has been provided for this image
Top spike years for IncomeComp==0:
Year
2000    31
2001    17
2002    17
2003    17
2004    15
2005    13
2006     4
2007     4
2008     4
2009     4
Name: Income composition of resources, dtype: int64
In [111]:
counts = dataframe.groupby("Year")["Schooling"].apply(lambda s: (s == 0.0).sum()).sort_index()

plt.figure(figsize=(8,3))
plt.plot(counts.index, counts.values, marker="o")
plt.title("Count of Schooling == 0.0 by Year (possible placeholder)")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()

print("Top spike years for Schooling==0:")
print(counts.sort_values(ascending=False).head(10))
No description has been provided for this image
Top spike years for Schooling==0:
Year
2000    8
2001    3
2002    3
2003    3
2004    2
2005    2
2013    2
2006    1
2007    1
2008    1
Name: Schooling, dtype: int64
In [112]:
df2 = dataframe.copy()

df2.loc[df2["Alcohol"] == 0.01, "Alcohol"] = np.nan

df2.loc[df2[ic] == 0.0, ic] = np.nan

df2.loc[df2["Schooling"] == 0.0, "Schooling"] = np.nan

print("Missing % AFTER placeholder->NaN:")
print((df2[cols].isna().mean()*100).round(3))
Missing % AFTER placeholder->NaN:
Total expenditure                   7.192
Alcohol                            15.890
Income composition of resources     9.555
Schooling                           5.925
dtype: float64
In [113]:
df2 = df2.sort_values(["Country", "Year"])

for c in cols:
    df2[c] = df2.groupby("Country")[c].transform(
        lambda s: s.interpolate(limit_direction="both")
    )

    df2[c] = df2.groupby("Country")[c].transform(lambda s: s.fillna(s.median()))

    df2[c] = df2[c].fillna(df2[c].median())

print("Missing % AFTER imputation:")
print((df2[cols].isna().mean()*100).round(3))
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Missing % AFTER imputation:
Total expenditure                  0.0
Alcohol                            0.0
Income composition of resources    0.0
Schooling                          0.0
dtype: float64
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
In [114]:
dataframe = df2
In [115]:
for c in cols:
    s = df2[c].dropna()

    plt.figure(figsize=(8,4))
    plt.hist(s, bins=40)
    plt.title(f"{c} distribution AFTER imputation")
    plt.xlabel(c)
    plt.ylabel("Count")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [116]:
def show_extremes(col, n=15):
    s = pd.to_numeric(dataframe[col], errors="coerce")
    print(f"\n=== {col} TOP {n} ===")
    print(dataframe.loc[s.nlargest(n).index, ["Country","Year",col]].sort_values(col, ascending=False).to_string(index=False))
    print(f"\n=== {col} BOTTOM {n} ===")
    print(dataframe.loc[s.nsmallest(n).index, ["Country","Year",col]].sort_values(col).to_string(index=False))

show_extremes("Alcohol")
show_extremes("Schooling")
show_extremes("Total expenditure")
show_extremes("Income composition of resources")
=== Alcohol TOP 15 ===
  Country  Year  Alcohol
  Estonia  2007    17.87
  Belarus  2011    17.31
  Estonia  2008    16.99
  Estonia  2006    16.58
  Belarus  2012    16.35
  Estonia  2005    15.52
Lithuania  2014    15.19
Lithuania  2015    15.19
Lithuania  2012    15.14
  Estonia  2004    15.07
  Estonia  2009    15.04
Lithuania  2013    15.04
  Estonia  2010    14.97
  Estonia  2011    14.97
  Estonia  2012    14.97

=== Alcohol BOTTOM 15 ===
                   Country  Year  Alcohol
               Afghanistan  2000     0.02
               Afghanistan  2001     0.02
               Afghanistan  2002     0.02
               Afghanistan  2003     0.02
               Afghanistan  2004     0.02
               Afghanistan  2005     0.02
               Afghanistan  2007     0.02
Iran (Islamic Republic of)  2000     0.02
Iran (Islamic Republic of)  2001     0.02
Iran (Islamic Republic of)  2002     0.02
Iran (Islamic Republic of)  2003     0.02
Iran (Islamic Republic of)  2004     0.02
Iran (Islamic Republic of)  2005     0.02
Iran (Islamic Republic of)  2006     0.02
Iran (Islamic Republic of)  2007     0.02

=== Schooling TOP 15 ===
    Country  Year  Schooling
  Australia  2004       20.7
  Australia  2003       20.6
  Australia  2001       20.5
  Australia  2000       20.4
  Australia  2014       20.4
  Australia  2015       20.4
  Australia  2005       20.3
  Australia  2006       20.3
  Australia  2013       20.3
New Zealand  2010       20.3
  Australia  2002       20.1
  Australia  2012       20.1
  Australia  2011       19.8
New Zealand  2011       19.7
  Australia  2010       19.5

=== Schooling BOTTOM 15 ===
     Country  Year  Schooling
       Niger  2000        2.8
    Djibouti  2000        2.9
    Djibouti  2001        2.9
       Niger  2001        2.9
       Niger  2002        2.9
       Niger  2003        3.0
       Niger  2004        3.1
    Djibouti  2002        3.3
Burkina Faso  2000        3.4
Burkina Faso  2001        3.5
    Djibouti  2003        3.5
       Niger  2005        3.5
Burkina Faso  2002        3.6
    Djibouti  2004        3.7
       Niger  2006        3.7

=== Total expenditure TOP 15 ===
                 Country  Year  Total expenditure
United States of America  2011              17.60
        Marshall Islands  2013              17.24
United States of America  2010              17.20
United States of America  2012              17.20
United States of America  2014              17.14
United States of America  2015              17.14
United States of America  2009              17.00
United States of America  2013              16.90
                  Tuvalu  2013              16.61
United States of America  2008              16.20
United States of America  2003              15.60
United States of America  2007              15.57
United States of America  2006              15.27
United States of America  2005              15.15
United States of America  2004              15.14

=== Total expenditure BOTTOM 15 ===
     Country  Year  Total expenditure
 Timor-Leste  2007               0.37
 Timor-Leste  2006               0.65
 Timor-Leste  2008               0.74
 Timor-Leste  2011               0.76
 Timor-Leste  2010               0.92
     Germany  2000               1.10
 Timor-Leste  2012               1.10
     Austria  2001               1.12
      Serbia  2013               1.12
Sierra Leone  2007               1.12
     Germany  2001               1.15
    Kiribati  2013               1.15
     Belgium  2010               1.17
       Japan  2012               1.17
     Denmark  2008               1.18

=== Income composition of resources TOP 15 ===
    Country  Year  Income composition of resources
     Norway  2015                            0.948
     Norway  2014                            0.945
     Norway  2013                            0.942
     Norway  2012                            0.941
     Norway  2011                            0.939
Switzerland  2015                            0.938
  Australia  2015                            0.937
  Australia  2014                            0.936
     Norway  2008                            0.936
     Norway  2009                            0.936
     Norway  2010                            0.936
Switzerland  2014                            0.936
     Norway  2007                            0.934
Switzerland  2013                            0.934
  Australia  2013                            0.933

=== Income composition of resources BOTTOM 15 ===
 Country  Year  Income composition of resources
   Niger  2000                            0.253
   Niger  2001                            0.255
   Niger  2002                            0.261
   Niger  2003                            0.266
 Burundi  2000                            0.268
 Burundi  2001                            0.268
 Burundi  2002                            0.268
   Niger  2004                            0.270
 Burundi  2003                            0.276
   Niger  2005                            0.278
 Burundi  2004                            0.279
Ethiopia  2000                            0.283
Ethiopia  2001                            0.283
    Chad  2003                            0.284
 Burundi  2005                            0.286
In [117]:
df.to_csv('output.csv')
In [118]:
cols = ["BMI","thinness 10-19 years","thinness 5-9 years",
        "Diphtheria","Polio","Adult Mortality","Life expectancy"]

for c in cols:
    dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
In [119]:
dataframe[["Country","Year","BMI"]].sort_values("BMI", ascending=False).head(15)
Out[119]:
Country Year BMI
1812 Nauru 2013 87.3
1958 Palau 2013 83.3
1650 Marshall Islands 2013 81.6
2713 Tuvalu 2013 79.3
1378 Kiribati 2015 77.6
1379 Kiribati 2014 77.1
1380 Kiribati 2013 76.7
1381 Kiribati 2012 76.2
1382 Kiribati 2011 75.7
2633 Tonga 2015 75.2
1383 Kiribati 2010 75.2
2634 Tonga 2014 74.8
2200 Samoa 2015 74.7
1384 Kiribati 2009 74.6
2201 Samoa 2014 74.3
In [120]:
checks = {
    "BMI==0": (dataframe["BMI"]==0).sum(),
    "BMI==1": (dataframe["BMI"]==1).sum(),
    "Thin10==0": (dataframe["thinness 10-19 years"]==0).sum(),
    "Thin5==0": (dataframe["thinness 5-9 years"]==0).sum(),
    "Diphtheria==0": (dataframe["Diphtheria"]==0).sum(),
    "Polio==0": (dataframe["Polio"]==0).sum()
}
print(checks)
{'BMI==0': 0, 'BMI==1': 1, 'Thin10==0': 0, 'Thin5==0': 0, 'Diphtheria==0': 0, 'Polio==0': 0}
In [121]:
print(dataframe.groupby("Year")["BMI"].apply(lambda s: (s==1).sum()).sort_values(ascending=False).head(10))
print(dataframe.groupby("Year")["Diphtheria"].apply(lambda s: (s==0).sum()).sort_values(ascending=False).head(10))
print(dataframe.groupby("Year")["Polio"].apply(lambda s: (s==0).sum()).sort_values(ascending=False).head(10))
Year
2002    1
2000    0
2001    0
2003    0
2004    0
2005    0
2006    0
2007    0
2008    0
2009    0
Name: BMI, dtype: int64
Year
2000    0
2001    0
2002    0
2003    0
2004    0
2005    0
2006    0
2007    0
2008    0
2009    0
Name: Diphtheria, dtype: int64
Year
2000    0
2001    0
2002    0
2003    0
2004    0
2005    0
2006    0
2007    0
2008    0
2009    0
Name: Polio, dtype: int64
In [122]:
for c in ["BMI","thinness 10-19 years","thinness 5-9 years","Diphtheria","Polio"]:
    tmp = dataframe[["Country","Year",c]].dropna().sort_values(["Country","Year"])
    tmp["absdiff"] = tmp.groupby("Country")[c].diff().abs()
    thresh = tmp["absdiff"].quantile(0.99)
    big = tmp[tmp["absdiff"] > thresh].sort_values("absdiff", ascending=False).head(10)
    print("\n==", c, "== 99th% absdiff threshold:", round(thresh,4))
    print(big[["Country","Year",c,"absdiff"]].to_string(index=False))
== BMI == 99th% absdiff threshold: 54.6
             Country  Year  BMI  absdiff
            Kiribati  2004 71.4     63.8
               Tonga  2008 71.5     63.7
              Kuwait  2015 71.4     63.6
               Samoa  2008 71.4     63.5
               Samoa  2006  7.3     62.4
               Tonga  2006  7.1     62.3
              Kuwait  2013  7.2     62.3
            Kiribati  2003  7.6     62.1
United Arab Emirates  2014 62.4     55.9
             Tunisia  2015 61.2     55.0

== thinness 10-19 years == 99th% absdiff threshold: 8.8
     Country  Year  thinness 10-19 years  absdiff
    Pakistan  2007                   2.8     18.2
    Pakistan  2012                  19.8     17.8
 Afghanistan  2002                  19.9     17.8
  Bangladesh  2005                  19.9     17.8
South Africa  2006                   1.6     10.0
     Namibia  2009                   1.9      9.6
    Botswana  2003                   1.9      9.5
     Lesotho  2002                   1.6      9.5
    Zimbabwe  2001                   1.6      9.4
       Niger  2010                   1.7      9.3

== thinness 5-9 years == 99th% absdiff threshold: 8.8
     Country  Year  thinness 5-9 years  absdiff
  Bangladesh  2003                 2.9     18.2
    Pakistan  2009                 2.9     18.2
    Pakistan  2014                19.8     17.8
  Bangladesh  2008                19.9     17.8
 Afghanistan  2003                19.9     17.7
South Africa  2008                 1.7     10.0
     Namibia  2009                 1.9      9.5
     Lesotho  2002                 1.6      9.5
    Zimbabwe  2001                 1.7      9.5
    Botswana  2003                 1.8      9.5

== Diphtheria == 99th% absdiff threshold: 83.0
        Country  Year  Diphtheria  absdiff
        Belarus  2003         5.0     94.0
        Belarus  2004        99.0     94.0
    Saint Lucia  2001        99.0     92.0
     Cabo Verde  2011         9.0     90.0
Solomon Islands  2011        99.0     90.0
Solomon Islands  2007         9.0     90.0
          Ghana  2014        98.0     89.0
        Ukraine  2008         9.0     89.0
      Swaziland  2015         9.0     89.0
           Peru  2001         9.0     89.0

== Polio == 99th% absdiff threshold: 84.0
        Country  Year  Polio  absdiff
    Saint Lucia  2001   99.0     92.0
        Comoros  2002   98.0     91.0
Solomon Islands  2006   99.0     90.0
     Cabo Verde  2011    9.0     90.0
    Saint Lucia  2002    9.0     90.0
        Comoros  2003    8.0     90.0
        Belarus  2008   98.0     89.0
        Belarus  2007    9.0     88.0
          Kenya  2011   97.0     88.0
        Ecuador  2004    9.0     88.0
In [123]:
spike_countries = ["Kiribati", "Tonga", "Kuwait", "Samoa", "United Arab Emirates", "Tunisia"]

tmp = dataframe.loc[dataframe["Country"].isin(spike_countries), ["Country","Year","BMI"]].copy()
tmp["BMI"] = pd.to_numeric(tmp["BMI"], errors="coerce")

for country in spike_countries:
    s = tmp[tmp["Country"] == country].sort_values("Year")
    plt.figure(figsize=(8,3))
    plt.plot(s["Year"], s["BMI"], marker="o")
    plt.title(f"BMI over time: {country}")
    plt.xlabel("Year")
    plt.ylabel("BMI")
    plt.grid(True, alpha=0.3)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [124]:
bmi = pd.to_numeric(dataframe["BMI"], errors="coerce")

print("BMI < 10:", int((bmi < 10).sum()))
print("BMI > 60:", int((bmi > 60).sum()))
print("Top high BMI rows:")
print(dataframe.loc[bmi > 60, ["Country","Year","BMI"]].sort_values("BMI", ascending=False).head(30).to_string(index=False))
print("\nTop low BMI rows:")
print(dataframe.loc[bmi < 10, ["Country","Year","BMI"]].sort_values("BMI").head(30).to_string(index=False))

plt.figure(figsize=(8,4))
plt.hist(bmi.dropna(), bins=60)
plt.title("BMI distribution (watch for weird mass <10 or >60)")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.show()
BMI < 10: 281
BMI > 60: 351
Top high BMI rows:
         Country  Year  BMI
           Nauru  2013 87.3
           Palau  2013 83.3
Marshall Islands  2013 81.6
          Tuvalu  2013 79.3
        Kiribati  2015 77.6
        Kiribati  2014 77.1
        Kiribati  2013 76.7
        Kiribati  2012 76.2
        Kiribati  2011 75.7
           Tonga  2015 75.2
        Kiribati  2010 75.2
           Tonga  2014 74.8
           Samoa  2015 74.7
        Kiribati  2009 74.6
           Tonga  2013 74.3
           Samoa  2014 74.3
        Kiribati  2008 74.1
           Samoa  2013 73.8
           Tonga  2012 73.8
        Kiribati  2007 73.4
           Samoa  2012 73.4
           Tonga  2011 73.3
           Samoa  2011 72.9
        Kiribati  2006 72.8
           Tonga  2010 72.7
           Samoa  2010 72.5
           Tonga  2009 72.1
        Kiribati  2005 72.1
           Samoa  2009 72.0
           Tonga  2008 71.5

Top low BMI rows:
                         Country  Year  BMI
                        Viet Nam  2002  1.0
                        Viet Nam  2003  1.4
                      Bangladesh  2000  1.4
                      Bangladesh  2001  1.8
                        Viet Nam  2004  1.9
                      Madagascar  2014  2.0
                          Rwanda  2013  2.1
                     Philippines  2005  2.1
                         Comoros  2007  2.1
                      Mozambique  2009  2.1
Democratic Republic of the Congo  2012  2.1
                           Benin  2004  2.1
                        Pakistan  2007  2.1
Lao People's Democratic Republic  2013  2.1
                           Kenya  2012  2.1
                   Guinea-Bissau  2005  2.1
                           Ghana  2001  2.1
                        Thailand  2002  2.2
                         Liberia  2000  2.2
                            Mali  2009  2.2
     United Republic of Tanzania  2009  2.2
                    Sierra Leone  2007  2.2
                          Zambia  2009  2.2
        Central African Republic  2010  2.2
               Equatorial Guinea  2005  2.2
                        Maldives  2008  2.3
                          Gambia  2004  2.3
                           Congo  2002  2.3
                          Bhutan  2010  2.3
                          Guinea  2009  2.3
No description has been provided for this image
In [125]:
df2 = dataframe.copy()
df2["BMI"] = pd.to_numeric(df2["BMI"], errors="coerce")

tmp = df2[["Country","Year","BMI"]].dropna().sort_values(["Country","Year"])
tmp["absdiff"] = tmp.groupby("Country")["BMI"].diff().abs()

# show the worst jumps
worst = tmp.sort_values("absdiff", ascending=False).head(50)
print(worst[["Country","Year","BMI","absdiff"]].to_string(index=False))

# list countries that have a huge jump (tune threshold if you want)
spike_countries = tmp.loc[tmp["absdiff"] > 30, "Country"].unique()
print("\nCountries with BMI jump > 30:", len(spike_countries))
print(list(spike_countries)[:50])
                                             Country  Year  BMI  absdiff
                                            Kiribati  2004 71.4     63.8
                                               Tonga  2008 71.5     63.7
                                              Kuwait  2015 71.4     63.6
                                               Samoa  2008 71.4     63.5
                                               Samoa  2006  7.3     62.4
                                               Tonga  2006  7.1     62.3
                                              Kuwait  2013  7.2     62.3
                                            Kiribati  2003  7.6     62.1
                                United Arab Emirates  2014 62.4     55.9
                                             Tunisia  2015 61.2     55.0
                                                Fiji  2013 61.1     54.9
                                               Libya  2012 61.8     54.9
                                              Turkey  2009 61.1     54.9
                                               Egypt  2015 61.1     54.9
                                              Jordan  2010 61.7     54.8
                            United States of America  2002 61.7     54.8
                                             Ireland  2013 61.3     54.8
                                        Saudi Arabia  2007 61.6     54.7
                                              Mexico  2012 61.5     54.7
                                              Poland  2014 61.1     54.7
                                            Portugal  2015 61.6     54.7
                                              Greece  2006 61.2     54.7
                                             Bahrain  2012 61.5     54.7
                                             Croatia  2011 61.3     54.7
                                              Canada  2005 61.3     54.7
                                                Cuba  2015 61.4     54.7
                                             Belarus  2013 61.1     54.6
                                               Spain  2006 61.1     54.6
                                               Chile  2011 61.2     54.6
                                         New Zealand  2004 61.5     54.6
                                             Lebanon  2006 61.4     54.6
                  Venezuela (Bolivarian Republic of)  2013 61.0     54.6
                                           Argentina  2012 61.0     54.6
                                            Bulgaria  2008 61.5     54.6
                                             Hungary  2009 61.1     54.6
                                           Australia  2005 61.5     54.6
United Kingdom of Great Britain and Northern Ireland  2006 61.3     54.6
                                             Ukraine  2015 61.3     54.6
                                          Montenegro  2014 61.3     54.6
                                             Bahamas  2010 61.3     54.6
                                             Germany  2013 61.4     54.5
                                         Netherlands  2013 61.0     54.5
                                              France  2012 61.1     54.5
                                              Israel  2006 61.1     54.5
                                               Italy  2010 61.0     54.5
                                             Uruguay  2010 61.2     54.5
                                           Lithuania  2013 61.4     54.5
                                              Latvia  2015 61.2     54.5
                                             Czechia  2005 61.3     54.5
                                              Norway  2015 61.2     54.4

Countries with BMI jump > 30: 108
['Albania', 'Algeria', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina', 'Brazil', 'Brunei Darussalam', 'Bulgaria', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Fiji', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica']
In [126]:
dataframe = dataframe.drop(columns=["BMI"], errors="ignore")
In [127]:
dataframe = dataframe.drop(columns=["thinness 5-9 years"], errors="ignore")
In [128]:
for c in ["Polio", "Diphtheria"]:
    dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")
    dataframe[c] = dataframe[c].clip(0, 100)
In [129]:
cols = ["Polio", "Diphtheria"]
for c in cols:
    dataframe[c] = pd.to_numeric(dataframe[c], errors="coerce")

print("Missing %:")
print((dataframe[cols].isna().mean()*100).round(3))

for c in cols:
    plt.figure(figsize=(8,4))
    plt.hist(dataframe[c].dropna(), bins=40)
    plt.title(f"{c} distribution (before imputation)")
    plt.xlabel(c)
    plt.ylabel("Count")
    plt.show()
Missing %:
Polio         0.651
Diphtheria    0.651
dtype: float64
No description has been provided for this image
No description has been provided for this image
In [130]:
for c in ["Polio","Diphtheria"]:
    s = pd.to_numeric(dataframe[c], errors="coerce")
    print("\n==", c, "==")
    print("min:", float(s.min()), "max:", float(s.max()))
    print("<10 count:", int((s < 10).sum()))
    print("<10 %:", round((s < 10).mean()*100, 3))
    print("==0 count:", int((s == 0).sum()))
== Polio ==
min: 3.0 max: 99.0
<10 count: 167
<10 %: 5.719
==0 count: 0

== Diphtheria ==
min: 2.0 max: 99.0
<10 count: 166
<10 %: 5.685
==0 count: 0
In [131]:
low = dataframe.loc[pd.to_numeric(dataframe["Polio"], errors="coerce") < 10,
                    ["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Polio"] = pd.to_numeric(low["Polio"], errors="coerce")
print("Low Polio rows:", len(low))
print(low.sort_values(["Polio","Country","Year"]).head(40).to_string(index=False))
Low Polio rows: 167
                         Country  Year  Polio  Diphtheria     Status
                          Angola  2000    3.0        28.0 Developing
                            Chad  2000    3.0        36.0 Developing
                            Chad  2008    3.0        19.0 Developing
Democratic Republic of the Congo  2001    3.0         3.0 Developing
               Equatorial Guinea  2012    3.0        24.0 Developing
               Equatorial Guinea  2013    3.0         3.0 Developing
                          Angola  2003    4.0         4.0 Developing
                          Angola  2004    4.0         4.0 Developing
        Central African Republic  2001    4.0         4.0 Developing
                            Chad  2011    4.0        33.0 Developing
Democratic Republic of the Congo  2002    4.0        38.0 Developing
                           Niger  2011    4.0        75.0 Developing
                         Nigeria  2002    4.0        25.0 Developing
                     Afghanistan  2004    5.0         5.0 Developing
                           Congo  2003    5.0         5.0 Developing
               Equatorial Guinea  2005    5.0        39.0 Developing
                           Haiti  2000    5.0        41.0 Developing
Lao People's Democratic Republic  2005    5.0        49.0 Developing
                     South Sudan  2013    5.0        45.0 Developing
            Syrian Arab Republic  2013    5.0        41.0 Developing
            Syrian Arab Republic  2015    5.0        41.0 Developing
                     Afghanistan  2015    6.0        65.0 Developing
Democratic Republic of the Congo  2005    6.0         6.0 Developing
                          Guinea  2009    6.0        57.0 Developing
                           Haiti  2005    6.0         6.0 Developing
Lao People's Democratic Republic  2008    6.0        61.0 Developing
                      Madagascar  2001    6.0         6.0 Developing
                         Nigeria  2008    6.0        53.0 Developing
                           Samoa  2011    6.0        65.0 Developing
                         Senegal  2002    6.0         6.0 Developing
                           Sudan  2002    6.0         6.0 Developing
            Syrian Arab Republic  2011    6.0        72.0 Developing
                          Angola  2015    7.0        64.0 Developing
                         Comoros  2000    7.0         7.0 Developing
                         Comoros  2001    7.0         7.0 Developing
                   Côte d'Ivoire  2001    7.0        66.0 Developing
                   Côte d'Ivoire  2002    7.0        64.0 Developing
                        Ethiopia  2011    7.0        65.0 Developing
                        Ethiopia  2012    7.0        69.0 Developing
                        Ethiopia  2013    7.0        72.0 Developing
In [132]:
low = dataframe.loc[pd.to_numeric(dataframe["Diphtheria"], errors="coerce") < 10,
                    ["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Diphtheria"] = pd.to_numeric(low["Diphtheria"], errors="coerce")
print("Low Diphtheria rows:", len(low))
print(low.sort_values(["Diphtheria","Country","Year"]).head(40).to_string(index=False))
Low Diphtheria rows: 166
                           Country  Year  Polio  Diphtheria     Status
                 Equatorial Guinea  2014   24.0         2.0 Developing
  Democratic Republic of the Congo  2001    3.0         3.0 Developing
                 Equatorial Guinea  2013    3.0         3.0 Developing
                          Ethiopia  2000   55.0         3.0 Developing
                            Angola  2003    4.0         4.0 Developing
                            Angola  2004    4.0         4.0 Developing
          Central African Republic  2001    4.0         4.0 Developing
                              Chad  2006   49.0         4.0 Developing
                              Chad  2012   51.0         4.0 Developing
  Democratic Republic of the Congo  2000   42.0         4.0 Developing
                 Equatorial Guinea  2006   52.0         4.0 Developing
                          Ethiopia  2004   54.0         4.0 Developing
                           Nigeria  2006   46.0         4.0 Developing
                       Afghanistan  2004    5.0         5.0 Developing
                           Belarus  2003   53.0         5.0 Developing
                             Congo  2003    5.0         5.0 Developing
                          Ethiopia  2007   61.0         5.0 Developing
                            Guinea  2001   52.0         5.0 Developing
  Lao People's Democratic Republic  2007   46.0         5.0 Developing
                           Liberia  2014   49.0         5.0 Developing
                              Togo  2001   51.0         5.0 Developing
                           Ukraine  2011   54.0         5.0 Developing
Venezuela (Bolivarian Republic of)  2008   76.0         5.0 Developing
                            Angola  2009   63.0         6.0 Developing
                          Cambodia  2001   59.0         6.0 Developing
  Democratic Republic of the Congo  2005    6.0         6.0 Developing
  Democratic Republic of the Congo  2010   76.0         6.0 Developing
                            Guinea  2004   65.0         6.0 Developing
                            Guinea  2008   59.0         6.0 Developing
                     Guinea-Bissau  2003   65.0         6.0 Developing
                             Haiti  2005    6.0         6.0 Developing
                             Haiti  2006   61.0         6.0 Developing
                             Haiti  2015   56.0         6.0 Developing
                           Liberia  2005   66.0         6.0 Developing
                           Liberia  2006   66.0         6.0 Developing
                        Madagascar  2001    6.0         6.0 Developing
                       Philippines  2015   79.0         6.0 Developing
                           Senegal  2002    6.0         6.0 Developing
                             Sudan  2002    6.0         6.0 Developing
                             Benin  2005   73.0         7.0 Developing
In [133]:
low = dataframe.loc[pd.to_numeric(dataframe["Polio"], errors="coerce") < 10,
                    ["Country","Year","Polio","Diphtheria","Status"]].copy()
low["Polio"] = pd.to_numeric(low["Polio"], errors="coerce")

print("Max Polio in low set:", low["Polio"].max())
print(low.sort_values("Polio").head(20).to_string(index=False))
Max Polio in low set: 9.0
                         Country  Year  Polio  Diphtheria     Status
Democratic Republic of the Congo  2001    3.0         3.0 Developing
                          Angola  2000    3.0        28.0 Developing
                            Chad  2000    3.0        36.0 Developing
               Equatorial Guinea  2012    3.0        24.0 Developing
               Equatorial Guinea  2013    3.0         3.0 Developing
                            Chad  2008    3.0        19.0 Developing
Democratic Republic of the Congo  2002    4.0        38.0 Developing
        Central African Republic  2001    4.0         4.0 Developing
                            Chad  2011    4.0        33.0 Developing
                         Nigeria  2002    4.0        25.0 Developing
                           Niger  2011    4.0        75.0 Developing
                          Angola  2004    4.0         4.0 Developing
                          Angola  2003    4.0         4.0 Developing
                           Congo  2003    5.0         5.0 Developing
               Equatorial Guinea  2005    5.0        39.0 Developing
Lao People's Democratic Republic  2005    5.0        49.0 Developing
                           Haiti  2000    5.0        41.0 Developing
                     Afghanistan  2004    5.0         5.0 Developing
            Syrian Arab Republic  2013    5.0        41.0 Developing
                     South Sudan  2013    5.0        45.0 Developing
In [134]:
df2 = dataframe.sort_values(["Country", "Year"]).copy()

for c in ["Polio", "Diphtheria"]:
    df2[c] = df2.groupby("Country")[c].transform(lambda s: s.interpolate(limit_direction="both"))
    df2[c] = df2.groupby("Country")[c].transform(lambda s: s.fillna(s.median()))
    df2[c] = df2[c].fillna(df2[c].median())

print("Missing % after Polio/Diphtheria imputation:")
print((df2[["Polio","Diphtheria"]].isna().mean()*100).round(4))

dataframe = df2
Missing % after Polio/Diphtheria imputation:
Polio         0.0
Diphtheria    0.0
dtype: float64
In [135]:
tmp = dataframe[["Life expectancy","Polio","Diphtheria"]].copy()
for c in tmp.columns:
    tmp[c] = pd.to_numeric(tmp[c], errors="coerce")

print(tmp.corr(numeric_only=True)["Life expectancy"].sort_values(ascending=False))
Life expectancy    1.000000
Diphtheria         0.464856
Polio              0.449946
Name: Life expectancy, dtype: float64
In [136]:
col = "thinness 10-19 years"
dataframe[col] = pd.to_numeric(dataframe[col], errors="coerce")

print("Missing %:", dataframe[col].isna().mean()*100)
print(dataframe[col].describe())

plt.figure(figsize=(8,4))
plt.hist(dataframe[col].dropna(), bins=40)
plt.title("thinness 10-19 years distribution")
plt.xlabel(col)
plt.ylabel("Count")
plt.show()
Missing %: 1.1643835616438356
count    2886.000000
mean        4.829522
std         4.428383
min         0.100000
25%         1.600000
50%         3.300000
75%         7.175000
max        27.700000
Name: thinness 10-19 years, dtype: float64
No description has been provided for this image
In [137]:
df_sorted = dataframe.sort_values(["Country","Year"]).copy()
df_sorted[col] = pd.to_numeric(df_sorted[col], errors="coerce")

tmp = df_sorted[["Country","Year",col]].dropna().copy()
tmp["absdiff"] = tmp.groupby("Country")[col].diff().abs()

thr = tmp["absdiff"].quantile(0.99)
spikes = tmp[tmp["absdiff"] > thr].sort_values("absdiff", ascending=False)

print("99th% jump threshold:", thr)
print("Top spikes:")
print(spikes.head(25).to_string(index=False))
99th% jump threshold: 8.8
Top spikes:
                         Country  Year  thinness 10-19 years  absdiff
                        Pakistan  2007                   2.8     18.2
                        Pakistan  2012                  19.8     17.8
                     Afghanistan  2002                  19.9     17.8
                      Bangladesh  2005                  19.9     17.8
                    South Africa  2006                   1.6     10.0
                         Namibia  2009                   1.9      9.6
                        Botswana  2003                   1.9      9.5
                         Lesotho  2002                   1.6      9.5
                        Zimbabwe  2001                   1.6      9.4
                           Niger  2010                   1.7      9.3
                         Nigeria  2012                   1.7      9.3
                    Burkina Faso  2003                   1.7      9.3
Democratic Republic of the Congo  2008                   1.8      9.3
                            Mali  2001                   1.8      9.2
                            Chad  2003                   1.9      9.2
                         Senegal  2008                   1.8      9.2
                     Timor-Leste  2014                   1.9      9.2
                        Ethiopia  2011                   1.9      9.1
                       Indonesia  2003                   1.9      9.1
                        Cambodia  2014                   1.9      9.1
                         Eritrea  2002                   9.9      8.9
Democratic Republic of the Congo  2013                   9.9      8.9
        Central African Republic  2004                   9.9      8.9
                         Senegal  2013                   9.9      8.9
Lao People's Democratic Republic  2006                   9.9      8.9
In [138]:
for country in spikes["Country"].head(5).unique():
    s = df_sorted[df_sorted["Country"]==country][["Year",col]].sort_values("Year")
    plt.figure(figsize=(9,3))
    plt.plot(s["Year"], s[col], marker="o")
    plt.title(f"{col} over time (spike check): {country}")
    plt.xlabel("Year"); plt.ylabel(col)
    plt.grid(True, alpha=0.3)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

“Thinness is measured per country-year; values are generally smooth but can have regime shifts likely due to data source/measurement changes.”

“We did not modify observed values; we only imputed missing values using country-wise interpolation, preserving each country’s trajectory.”

In [139]:
df2 = dataframe.sort_values(["Country","Year"]).copy()
df2[col] = pd.to_numeric(df2[col], errors="coerce")

df2.loc[(df2[col] < 0) | (df2[col] > 50), col] = np.nan

df2[col] = df2.groupby("Country")[col].transform(lambda s: s.interpolate(limit_direction="both"))
df2[col] = df2.groupby("Country")[col].transform(lambda s: s.fillna(s.median()))
df2[col] = df2[col].fillna(df2[col].median())

print("Missing % after thinness impute:", df2[col].isna().mean()*100)

dataframe = df2
Missing % after thinness impute: 0.0
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\numpy\lib\nanfunctions.py:1215: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
In [140]:
dataframe = dataframe.drop(columns=["Adult Mortality"], errors="ignore")
In [141]:
y = "Life expectancy"
dataframe[y] = pd.to_numeric(dataframe[y], errors="coerce")

missing_le = dataframe[dataframe[y].isna()]
print("Missing Life expectancy rows:", len(missing_le))
Missing Life expectancy rows: 8
In [142]:
countries = missing_le["Country"].value_counts()

print("Countries with missing Life expectancy:", len(countries))
print(countries.head(30))      # top 30 countries by missing count
print("\nAll affected countries:")
print(countries.index.tolist())
Countries with missing Life expectancy: 8
Country
Dominica                 1
Marshall Islands         1
Monaco                   1
Nauru                    1
Palau                    1
Saint Kitts and Nevis    1
San Marino               1
Tuvalu                   1
Name: count, dtype: int64

All affected countries:
['Dominica', 'Marshall Islands', 'Monaco', 'Nauru', 'Palau', 'Saint Kitts and Nevis', 'San Marino', 'Tuvalu']
In [201]:
dataframe = dataframe[dataframe["Life expectancy"].notna()].copy()
In [189]:
dataframe = dataframe.drop("Country_wb",axis = 1)

Feature Engineering¶

Feature Engineering je metoda koju primenjujemo nad podacima posmatranog skupa podataka. Ideja je da se kroz kombinovanje, transformaciju ili restrukturiranje postojećih promenljivih izvuče dodatna informacija koja nije eksplicitno sadržana u originalnim podacima. Formiranjem novih promenljivih omogućavamo modelu da lakše prepozna obrasce i odnose u podacima, čime se može poboljšati prediktivna moć modela.

Pre nego što započnemo preformulisanje naših promenljivih, sagledajmo koje promenljive dataseta smo ostavili (Prethodno smo izbacili "BMI" zbog velikih nelogičnosti, kao i "thinness 1-9 years" usled idetnične korelacije i raspodele sa drugom thinness promenljivom)

In [190]:
dataframe.head()
Out[190]:
Country Year Life expectancy infant deaths Alcohol percentage expenditure Hepatitis B Measles under-five deaths Polio ... Diphtheria HIV/AIDS GDP Population thinness 10-19 years Income composition of resources Schooling Status_Developing immunization_index log_thinness 10-19 years
15 Afghanistan 2000 54.8 88 0.02 10.424960 62.0 6532 122 24.0 ... 24.0 0.1 114.560000 29375600.0 2.3 0.338 5.5 1 36.666667 1.193922
14 Afghanistan 2001 55.3 88 0.02 10.574728 63.0 8762 122 35.0 ... 33.0 0.1 117.496980 29664630.0 2.1 0.340 5.9 1 43.666667 1.131402
13 Afghanistan 2002 56.2 88 0.02 16.887351 64.0 2486 122 36.0 ... 36.0 0.1 187.845950 21979923.0 19.9 0.341 6.2 1 45.333333 3.039749
12 Afghanistan 2003 56.7 87 0.02 11.089053 65.0 798 122 41.0 ... 41.0 0.1 198.728544 23648510.0 19.7 0.373 6.5 1 49.000000 3.030134
11 Afghanistan 2004 57.0 87 0.02 15.296066 67.0 466 120 5.0 ... 5.0 0.1 219.141353 24118979.0 19.5 0.381 6.8 1 25.666667 3.020425

5 rows × 21 columns

Najpre, pošto smo sagledali ogromnu značajnost promenljive Status gde nam je što empirijski poznato što zbog domenskog znanja da stanovnici razvijene Zemlje imaju duži životni vek, te ćemo ovu promenljivu kodirati u True i False labele.

In [161]:
dataframe = pd.get_dummies(
    dataframe,
    columns=["Status"],
    drop_first=True
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[161], line 1
----> 1 dataframe = pd.get_dummies(
      2     dataframe,
      3     columns=["Status"],
      4     drop_first=True
      5 )

File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\reshape\encoding.py:170, in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    168     raise TypeError("Input must be a list-like for parameter `columns`")
    169 else:
--> 170     data_to_encode = data[columns]
    172 # validate prefixes and separator to avoid silently dropping cols
    173 def check_len(item, name: str):

File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py:4119, in DataFrame.__getitem__(self, key)
   4117     if is_iterator(key):
   4118         key = list(key)
-> 4119     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4121 # take() does not accept boolean indexers
   4122 if getattr(indexer, "dtype", None) == bool:

File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6212, in Index._get_indexer_strict(self, key, axis_name)
   6209 else:
   6210     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6212 self._raise_if_missing(keyarr, indexer, axis_name)
   6214 keyarr = self.take(indexer)
   6215 if isinstance(key, Index):
   6216     # GH 42790 - Preserve name from an Index

File c:\Users\lukam\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\indexes\base.py:6261, in Index._raise_if_missing(self, key, indexer, axis_name)
   6259 if nmissing:
   6260     if nmissing == len(indexer):
-> 6261         raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6263     not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6264     raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['Status'], dtype='object')] are in the [columns]"
In [202]:
dataframe["Status_Developing"] = dataframe["Status_Developing"].astype(int)

Zatim možemo posmatrati pokrivenost imunizacije po državama, pošto promenljive "Diphteria", "Polio" i "Hepatitis B" sve predstavljaju procentualnu imunizaciju gradjana, možemo uzeti prosek ovih promenljivih i tako ih posmatrati na nivou države.

In [163]:
dataframe["immunization_index"] = (
    dataframe["Hepatitis B"] +
    dataframe["Polio"] +
    dataframe["Diphtheria"]
) / 3

Takodje iz ranije prikazanih grafova, činio se kao očigledno dobar izbor da se više promenljivih predstave preko logaritamskih transformacija. Promenljive koje su se činile kao veoma dobar izbor za ovo, iz razloga što je njihova raspodela bila jako desno asimetrična (većina podataka se grupisalo u klaster sa leve strane), su "GDP", "infant deaths", "HIV/AIDS".

In [164]:
features_log = ["GDP","HIV/AIDS","infant deaths"]

for feature in features_log:
    dataframe[f"log_{col}"] = np.log1p(dataframe[col])
In [165]:
dataframe.head()
Out[165]:
Country Year Life expectancy infant deaths Alcohol percentage expenditure Hepatitis B Measles under-five deaths Polio ... HIV/AIDS GDP Population thinness 10-19 years Income composition of resources Schooling Country_wb Status_Developing immunization_index log_thinness 10-19 years
15 Afghanistan 2000 54.8 88 0.02 10.424960 62.0 6532 122 24.0 ... 0.1 114.560000 29375600.0 2.3 0.338 5.5 Afghanistan 1 36.666667 1.193922
14 Afghanistan 2001 55.3 88 0.02 10.574728 63.0 8762 122 35.0 ... 0.1 117.496980 29664630.0 2.1 0.340 5.9 Afghanistan 1 43.666667 1.131402
13 Afghanistan 2002 56.2 88 0.02 16.887351 64.0 2486 122 36.0 ... 0.1 187.845950 21979923.0 19.9 0.341 6.2 Afghanistan 1 45.333333 3.039749
12 Afghanistan 2003 56.7 87 0.02 11.089053 65.0 798 122 41.0 ... 0.1 198.728544 23648510.0 19.7 0.373 6.5 Afghanistan 1 49.000000 3.030134
11 Afghanistan 2004 57.0 87 0.02 15.296066 67.0 466 120 5.0 ... 0.1 219.141353 24118979.0 19.5 0.381 6.8 Afghanistan 1 25.666667 3.020425

5 rows × 22 columns

Data preprocessing¶

Pre nego što se upustimo u feature selection i odabir najboljih promenljivih za naš model, želimo da pretprocesiramo podatke tako da naš model što efikasnije barata sa njima, a ujedno možemo i da raspodelimo podatke na tri skupa:

  • Trening skup ovo je skup podataka koji model koristi pri treniranju, odnosno skup podataka za koji model pravi predikciju, računa grešku i koriguje se tako što promeni parametre koje koristi pri predikciji.

  • Validacioni skup ovo je skup podataka koje model koristi da nakon treniranja sagleda koliko je naučio, pravi predikcije i računa metrije nad validacionim skupom kako bi sagledali kako se model ponaša kada vidi nove podatke, ujedno sagledamo razlike metrike nad validacionim i trening skupom kako bi uočili da li postoji overfit.

  • Test skup ovo je skup podataka koje model vidi kada u potpunosti završi sa treniranjem, to su podaci koje model nikada ranije nije video i služe kao pravo merilo uspešnosti modela.

Pre nego što podelimo naše podatke na ova tri skupa, neophodno je da razdvojimo ciljanu promenljivu "Life expectancy" od trening skupa kako model ne bi imao uvod u ono šta predvidja, ujedno moramo i skalirati podatke koje pripadaju trening skupu. Ovo radimo pomoću StandardScaler bibloleteke koja na osnovu srednje vrednosti i standardne devijacije skalira podatke čime dobijamo da svi podaci budu na jednoj istoj skali i smanjujemo dominaciju outliera.

In [203]:
X = dataframe.drop(["Life expectancy","Country"], axis = 1)
y = dataframe["Life expectancy"]
In [210]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
X_scaled["Status_Developing"] = X["Status_Developing"]
In [205]:
X_scaled.head()
Out[205]:
Year infant deaths Alcohol percentage expenditure Hepatitis B Measles under-five deaths Polio Total expenditure Diphtheria HIV/AIDS GDP Population thinness 10-19 years Income composition of resources Schooling Status_Developing immunization_index log_thinness 10-19 years
0 -1.626978 0.487289 -1.259174 -0.367848 -0.750392 0.358130 0.496818 -2.512290 0.909136 -2.471073 -0.324177 -0.593821 0.016338 -0.571880 -2.012005 -2.134521 1.0 -2.230649 -0.447598
1 -1.410048 0.487289 -1.259174 -0.367773 -0.709743 0.551926 0.496818 -2.040647 0.748901 -2.090006 -0.324177 -0.593633 0.019053 -0.617262 -1.999361 -2.005207 1.0 -1.884723 -0.536001
2 -1.193118 0.487289 -1.259174 -0.364609 -0.669094 0.006517 0.496818 -1.997770 0.732877 -1.962984 -0.324177 -0.589120 -0.053120 3.421734 -1.993039 -1.908222 1.0 -1.802359 2.162364
3 -0.976187 0.478843 -1.259174 -0.367515 -0.628445 -0.140177 0.496818 -1.783387 1.157501 -1.751280 -0.324177 -0.588421 -0.037449 3.376352 -1.790731 -1.811236 1.0 -1.621160 2.148768
4 -0.759257 0.478843 -1.259174 -0.365406 -0.547146 -0.169029 0.484402 -3.326947 1.145484 -3.275546 -0.324177 -0.587112 -0.033031 3.330970 -1.740154 -1.714251 1.0 -2.774246 2.135040

Prikazom ovih podataka, vidimo da su svi podaci uspešno skalirani normalizacijom. Sada možemo podeliti podatke na train-validation-test skupove.

In [212]:
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.3,random_state=42,stratify=X_scaled["Status_Developing"])

X_val,X_test,y_val,y_test = train_test_split(X_temp,y_temp,test_size = 0.5,random_state = 42,stratify=X_temp["Status_Developing"])

Train test split smo odradili sa opcijom stratify iza kojeg je ideja da raspodelimo ova dva skupa podataka tako da imaju jednaku proporciju primera gde je status 1/0 odnosno Developing/Developed, takodje uzimamo da veličina testnog skupa bude 20% ukupne veličine skupa podataka.

Feature Selection i procena modela¶

Feature Selection predstavlja suštinu celokupnog procesa obavljenog nad skupom podataka. Ovom metodom biramo promenljive koje će model koristiti za predikcije, promenljive biramo na osnovu svih zaključaka koje smo dobili kroz sve prethodne metode, gde je cilj da zadržimo samo releveantne karakteristike uz pomoć kojih model dobija informacije, a da se irelevantne ili visoko korelisane promenljive odbace.

Matrica korelacije¶

In [214]:
plt.figure(figsize=(14, 10))
numeric_data = dataframe.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_data.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

Sagledavši matricu korelacije i na osnovu zaključaka sa prethodnih grafova za sledeće promenljive se odlučujemo da se sigurno neće razmatrati u feature selection-u:

  • percentage expenditure: pošto je u jakoj korelaciji sa GDP-om, što ukazuje na multikolinearnost, dodatno na osnovu domenskog znanja očekujemo da države koje imaju velik GDP imaju i veći life expectancy zato što gradjani žive komfornije živote i pre svega zato što imaju bolji zdravstveni sistem.

  • under-five deaths: vidimo korelaciju 1, što ukazuje na sigurnu multikolinearnost i znamo da obe promenljive opisuju veoma sličnu stvar (najveći broj preminule dece ispod 5 godina pripada starosnoj dobi novorodjenčadi) koja bi dovela do haosa sa težinama modela.

  • Popultaion: na osnovu grafika i matrice korelacije (Life Expectancy ~ Population = -0.03) je veoma jasno da korelacija praktično ne postoji.

  • Income composition of resources: Gotovo sigurna multikolinearnost sa Schooling, u ovom slučaju možemo izabrati bilo koju od ove dve promenljive, ali biramo Schooling pošto je lakša za interpretaciju.

  • Country: Pošto ima isuviše jedinstvenih vrednosti ne možemo koristiti ovu promenljivu, primenom one-hot encodinga bi dobili previše novih kolona i zakomplikovali model.

  • Year: Ne postoji dovoljan broj godina da uhvatimo očigledan trend za države na osnovu godina.

Forward-selection¶

Sada ćemo primeniti metodu forward-selection. Ideja iza ove metode je da napravimo početni model koji sadrži samo jednu promenljivu za koju smatramo da može objasniti najveći udeo varijabilnosti promenljive "Life expectancy", zatim da propratimo predikcije koje dobijamo takvim modelom i uporediti metrikama RSE MAE i R^2. Promenljive koje koristimo kao prediktore modela će biti isključivo iz EDA koje smo smatrali da dobro opisuju ciljanu promenljivu i Feature Engineering odeljka.

Promenljiva koju prvo biramo za naš model će biti GDP, pošto se sa grafika videla jasna korelacija GDP-a države i očekivanog životnog veka.

In [217]:
model = LinearRegression()

features =[
    "GDP",
    "Schooling",
    "infant deaths",
    "thinness 10-19 years",
    "Status_Developing",
    "Alcohol",
    "immunization_index",
    "HIV/AIDS"
]

target = "Life expectancy"

Pošto forward-selection predstavlja iterativni proces, napravićemo listu promenljivih koje želimo da uključimo u forward selection, i posmatrati poboljšanja pri dodavanju svake od ovih promenljivih. Posmatrane promenljive su

  • GDP: Korelacija srednje jacine sa Life Expectancy, na osnovu domenskog znanja je veoma logično izabrati ovu promenljivu pošto iziskuje da države sa visokim GDPom imaju visoke životne standarde ali pre svega je smisleno pretpostaviti i da ulažu dosta novca u zdravstveni sistem.

  • Schooling: Jaka korelacija sa Life Expectancy, takodje na osnovu grafika je bilo veoma prominentno da razvijene države imaju visok nivo edukacije što iziskuje i visoko očekivanje životnog veka te populacije. Ovime se takodje naznačava da viši nivo edukacije pored toga što doprinosi više opcija pojedinicu, doprinosi i da pojedinac čuje više različitih mišljenja ali i da ima veću svest o bitnosti redovnih i sistematskih pregleda, kao i svest o tome šta bi prvi siptomi koje iskusi mogli da naznače.

  • infant deaths: Iako nema jaku korelaciju, smisleno je odabrati ovu promenljivu pošto najčešće ukazuje na probleme sa zdravstvenim sistemom, pregledima i brizi o novorodjenčadima kao i moguće prisustvo odredjenih bolesti ili epidemija.

  • thinness 10-19 years: Ova promenljiva uparena sa Status promenljivom iziskivala je da razvijene države imaju nizak broj mršavosti adolescenata/tinejdžera, što ukazuje na prikladnu ishranu, dostupnost hrane i najčešće u slučaju ovih država svest o bitnosti pravilne i normalne ishrane.

  • Status_Developing: Kategorijska promenljiva za koju empirijski znamo (kroz niz grafika) a i na osnovu domenskog znanja da ima jak uticaj na ciljanu promenljivu. Možemo napraviti sličan komentar kao za GDP, da razvijene zemlje ujedno imaju i razvijeno zdravstvo, pobudjenu svest o zdravom životu i slično.

  • Alcohol: Iako je na graficima bilo prisutno vidjenje da razvijene države imaju visoku konzumpciju alkohola, smisleno je da te države ujedno i dobro balansiraju ovaj faktor uz pomoć jakog zdravstvenog sistema i vidimo da možda propagiraju norme bezbednijeg konzumiranja alkohola. Ujedno je prisutna ideja i da u razvijenim državama ljudi piju alkohol češće ali u manjim količinama, posebno zato što se propagira da čaša vina uveče posle posla može doneti i zdravstvene benefite.

  • immunization_index: Promenljiva koja opisuje koliko jedna država ima jaku imunizaciju, potkrepljena je nivoom svesti o zdravstvu naručito pošto postoje osobe koje su ubedjene da vakcinacija ne doprinosi ničemu već da služi kako bi državni organi menjali RNK ljudi, ubacivali nano čipove i ostale apsurdnosti. Sa druge strane spektruma, može oslikati siromaštvo zemalja, naječešće kod u potpunosti nerazvijenih zemalja (nažalost najčešće na afričkom kontinentu) koji jedva da imaju protokole vakcinacije i veoma retke sistematske preglede.

  • HIV/AIDS: Promenljiva ima solidnu korelaciju sa Life Expectancy promenljivom, na osnovu domenskog znanja je ponovo jako smisleno odabrati promenljivu pošto države sa niskim brojem prijavljenih slučaja HIV-a ukazuju na normalan nivo socijalne svesti i ponašanja pojedinaca koji su u direktnoj korelaciji sa životnim vekom.

In [222]:
selected_features = []

print("FORWARD SELECTION REZULTATI\n")

def adjR2(r2,n,p):
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
    return adj_r2

for feature in features:
    
    selected_features.append(feature)
    
    model.fit(X_train[selected_features], y_train)
    
    y_train_pred = model.predict(X_train[selected_features])
    y_val_pred = model.predict(X_val[selected_features])
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    
    train_r2 = r2_score(y_train, y_train_pred)
    val_r2 = r2_score(y_val, y_val_pred)

    n = X_val[selected_features].shape[0]
    p = X_val[selected_features].shape[1]

    adj_r2_val = adjR2(val_r2,n,p)
    
    print("Features:", selected_features)
    print("Train RMSE:", round(train_rmse, 3))
    print("Val RMSE:", round(val_rmse, 3))
    print("Train R2:", round(train_r2, 3))
    print("Val R2:", round(val_r2, 3))
    print("Adjusted Val R2:", round(adj_r2_val, 3))
    print("-" * 40)
FORWARD SELECTION REZULTATI

Features: ['GDP']
Train RMSE: 7.951
Val RMSE: 7.862
Train R2: 0.302
Val R2: 0.306
Adjusted Val R2: 0.305
----------------------------------------
Features: ['GDP', 'Schooling']
Train RMSE: 6.065
Val RMSE: 5.717
Train R2: 0.594
Val R2: 0.633
Adjusted Val R2: 0.631
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths']
Train RMSE: 6.056
Val RMSE: 5.707
Train R2: 0.595
Val R2: 0.634
Adjusted Val R2: 0.632
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years']
Train RMSE: 5.968
Val RMSE: 5.726
Train R2: 0.607
Val R2: 0.632
Adjusted Val R2: 0.629
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing']
Train RMSE: 5.96
Val RMSE: 5.717
Train R2: 0.608
Val R2: 0.633
Adjusted Val R2: 0.629
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol']
Train RMSE: 5.893
Val RMSE: 5.574
Train R2: 0.616
Val R2: 0.651
Adjusted Val R2: 0.646
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol', 'immunization_index']
Train RMSE: 5.705
Val RMSE: 5.313
Train R2: 0.641
Val R2: 0.683
Adjusted Val R2: 0.678
----------------------------------------
Features: ['GDP', 'Schooling', 'infant deaths', 'thinness 10-19 years', 'Status_Developing', 'Alcohol', 'immunization_index', 'HIV/AIDS']
Train RMSE: 4.589
Val RMSE: 4.346
Train R2: 0.767
Val R2: 0.788
Adjusted Val R2: 0.784
----------------------------------------

(Napomena, da biste videli sve rezultati, pogledajte poslednji output kao scrollable element) Posmatranjem forward selectiona i svih iteracija, dolazimo do zaključka da promenljive: infant deaths, thinness 10-19 years i Status Developing skoro uopšte ne poboljšavaju metrike modela, to jest, jasno se vidi da RMSE ne opada i da Adjusted R^2 ne raste, odakle dolazimo da zaključka da sa ovim promenljivima model stagnira. Zadržavamo sve ostale promenljive i sagledavamo metrike nad njima.

In [226]:
final_features =[
    'GDP',
    'Schooling',
    'Alcohol',
    'immunization_index',
    'HIV/AIDS',
]

model.fit(X_train[final_features], y_train)

y_train_pred = model.predict(X_train[final_features])
y_val_pred = model.predict(X_val[final_features])

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

n = X_val[final_features].shape[0]
p = X_val[final_features].shape[1]

adj_r2_val = adjR2(val_r2,n,p)

print("Features:", final_features)
print("Train RMSE:", round(train_rmse, 3))
print("Val RMSE:", round(val_rmse, 3))
print("Train R2:", round(train_r2, 3))
print("Val R2:", round(val_r2, 3))
print("Adjusted Val R2:", round(adj_r2_val, 3))
print("-" * 40)
Features: ['GDP', 'Schooling', 'Alcohol', 'immunization_index', 'HIV/AIDS']
Train RMSE: 4.66
Val RMSE: 4.41
Train R2: 0.76
Val R2: 0.782
Adjusted Val R2: 0.779
----------------------------------------

Izbacivanjem ovih promenljivih i poredjenjem metrika prvobitnog modela vidimo da je prvobitni model minimalno bolji u predikciji (za adj R2 razlika je 0.005), taj boljitak je statistički neznačajan pa ćemo zadržati model sa manje promenljivih čime osiguravamo da smo zadržali samo ključne promenljive. Ovom oznakom metrike R2 = 0.779 što nam govori da naš model objašnjava 78% varijabilnosti promenljive "Life Expectancy" nad validacionim skupom, ujedno vidimo da je i RMSE nad validacionim skupom 4.41 što je solidan rezultat. Ostalo nam je još da proverimo da li postoji multikolinearnost izmedju datih promenljivih, uporedjivanjem ovih promenljivih se na oko čini da to ne bi trebao da bude slučaj, ali nam je potrebno da to potporimo računom, s toga ćemo izračunati VIF ovog modela.

VIF (Variance Inflation Factor) je metrika koja nam naznačava koliko su težine modela povećane zbog multikolinearnosti medju nezavisnim promenljivima, generalno rečeno vrednost za VIF koja je manja od 5 se smatra da ne postoji jaka korelisanost izmedju promenljivih posmatranog modela.

In [ ]:
X = dataframe[final_features]
X = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)
              Feature        VIF
0               const  29.179061
1                 GDP   1.447015
2           Schooling   2.056805
3             Alcohol   1.533496
4  immunization_index   1.244778
5            HIV/AIDS   1.078548

Sada kada sagledamo VIF vrednosti za sve nezavisne promenljive koje smo uključili u naš model, možemo videti da ne postoji jaka korelisanost izmedju promenljivih. Nakon poredjenja VIF metrike, smatramo da naš model zadovoljavajuće generalizuje problem predvidjanja očekivanog životnog veka, s toga napokon možemo sagledati njegovo ponašanje nad testnim skupom.

In [238]:
y_test_pred = model.predict(X_test[final_features])

test_rmse = np.sqrt(mean_squared_error(y_test,y_test_pred))
test_mae = mean_absolute_error(y_test,y_test_pred)
test_r2 = r2_score(y_test,y_test_pred)

print("TEST RMSE :",test_rmse)
print("TEST MAE : ",test_mae)
print("TEST R2",test_r2)
TEST RMSE : 4.144140131371196
TEST MAE :  3.247733999905864
TEST R2 0.8018094752106939

Nad testnim skupom dobijamo očekivanu vrednost za R2 metriku koja je približno jednaka onoj sa validacionog skupa što dodatno naznačava da ne postoji overfit u bilo kakvom smislu, zaključak je isti odnosno model uspeva da objasni 80% varijabilnosti ciljne promenljive "Life Expectancy". Odnosno objašnjava 80% podataka koji odstupaju od proseka promenljive "Life Exepctancy" dok ostalih 20% potiču od faktora koje možda nismo uvedeli ali je vrlo verovatnije da su nastali od velikog šuma koji je bio veoma prisutan u skupu podataka. S druge strane MAE nam ukazuje na to da u proseku naš model u proseku greši za ±3.2 godine u svojim predvidjanjima, što je na nivou države sasvim solidno predvidjanje jednog ovoliko prostog modela.

Implementacija ostalih modela i poredjenje¶

Glavno pitanje koje je postavljeno pri izradi ovog seminarskog rada je sledeće: "Kako da pomoću socio-ekonomskih faktora predstavimo model koji može da prediktuje životni vek države?" Probali smo da odgovorimo na ovo pitanje kreiranjem modela linearne regresije koji prediktuje kontinuirani tip podataka promenljive "Life Expectancy", pošto je ovaj problem regresioni možemo primeniti i modele poput Ridge/Lasso Regression, Random Forest, XGBoost, itd. Na ovaj način možemo direktno uporediti naš model sa osatlim modelima i doći do novih zapažanja i odnosa koje možda nismo uvideli.

Ridge i Lasso Regularizacija¶

Prve modele koje ćemo posmatrati koriste metode regularizacije linearne regresije Ridge i Lasso Regression. Ideja modela je da osnovnu linearnu regresiju modifikuju uključujući kazneni parametar alfa(lambda) koji smanjuje vrednost (ili anulira u potpunosti) težinskih koeficijenata koji stoje uz odgovarajuće prediktore, sve zarad veće moći generalizacije. Ova dva metoda se razlikuju po tome što Ridge smanjuje pojedine koeficijente toliko da postaju približno jedanki nuli, dok Lasso postavlja vrednost koeficijenata na nula i tako ih u potpunosti uklanja iz jednačine.

Pre nego što napravimo Ridge i Lasso modele odredićemo najbolje vrednosti za paramtear alfa koristeći Cross Validaciju.

In [253]:
alphas = [0.001, 0.01, 0.1, 1, 10, 100]

ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train, y_train)

best_ridge_alpha = ridge_cv.alpha_

print("Best alpha Ridge:", best_ridge_alpha)

lasso_cv = LassoCV(alphas=alphas, max_iter=10000)
lasso_cv.fit(X_train, y_train)

best_lasso_alpha = lasso_cv.alpha_

print("Best alpha Lasso:", best_lasso_alpha)
Best alpha Ridge: 1.0
Best alpha Lasso: 0.001

Sada za ovako dobijene najbolje parametre alpha treniramo naše modele i dobijamo R2,RMSE i MAE

In [269]:
ridge = Ridge(alpha=best_ridge_alpha, max_iter=10000)   
ridge.fit(X_train, y_train)

y_pred_ridge = ridge.predict(X_test)

print("Ridge RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_ridge)))
print("Ridge MAE:", mean_absolute_error(y_test, y_pred_ridge))
Ridge RMSE: 3.3521749529932077
Ridge MAE: 2.468885417275904
In [270]:
lasso = Lasso(alpha=best_lasso_alpha, max_iter=10000)
lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_test)

print("Lasso RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_lasso)))
print("Lasso MAE:", mean_absolute_error(y_test, y_pred_lasso))
Lasso RMSE: 3.3561175428908134
Lasso MAE: 2.472267776463046

Sledeća dva modela koja ćemo posmatrati su Random Forest i XGBoost.

Random Forest predstavlja skup metoda koja kombinuje veliki broj stabala odlučivanja (decision trees), svako stablo se trenira na nasumičnom podskupu podataka i podskupu promenljivih, čime se smanjuje varijansa modela. Konačna predikcija dobija se prosekom (kod regresije) ili glasanjem (kod klasifikacije), model jako dobro zaobilazi problem overfitting-a i dobro funkcioniše i kada postoji nelinearna zavisnost između promenljivih.

XGBoost (Extreme Gradient Boosting) je optimizovana implementacija gradient boosting algoritma,on je takodje implementacija stabla odlučivanja, ideja je da model gradimo sekvencijalno, tako što svako novo stablo pokušava da ispravi greške prethodnih stabala. Koristi regularizaciju (L1 i L2) kako bi se smanjila kompleksnost modela i sprečio overfitting što je generalna odlika Decision tree algoritama, zbog visoke efikasnosti i performansi, XGBoost se često koristi u takmičarskim i realnim problemima u oblasti mašinskog učenja.

In [271]:
rf = RandomForestRegressor(n_estimators=200,max_depth=None,random_state=42)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("Random forest RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("Random forest MAE:", mean_absolute_error(y_test, y_pred_rf))
Random forest RMSE: 1.6068533257307265
Random forest MAE: 1.1079782608695632
In [274]:
xgb = XGBRegressor(n_estimators=200,learning_rate=0.05,max_depth=4,subsample=0.8,colsample_bytree=0.8,random_state=42)

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)

print("XGBoost RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_xgb)))
print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb))
XGBoost RMSE: 1.7944843679286826
XGBoost MAE: 1.324241596283029

Poredjenje modela¶

Nakon što smo formirali i iztrenirali sve modele sada možemo da uporedimo sve metrike koje smo dobili i uvideti kako se naš model linearne regresije poredi sa ostalima

Model RMSE MAE
Linear Regression 4.14 3.24
Ridge 3.35 2.46
Lasso 3.35 2.47
Random Forest 1.60 1.10
XGBoost 1.79 1.32

Prikazom tabele, možemo prvo uporediti naš linearni model sa dva modela koja koriste regularizaciju (Ridge i Lasso), poredjenjem uvidimo da naš model greši za ≈ 0.8 godina za MAE što je sasvim solidan rezultati ako imamo u vidu da ovi metodi regularizacije traže što optimalniji model tako što isključuju odredjene prediktore preko koeficijenta i time traže što optimalniji model. Sa druge strane ako posmatramo Random Forest i XGBoost modele, u njihovom slučaju vidimo značajne dobtike, gde su MAE i RMSE oba modela jako niski pokazujući veliku uspešnost predikcije Life Expectancy promenljive. Ovo ukazuje na ključnu prednost modela zasnovanih na stablima odlučivanja - sposobnost modelovanja nelinearnih odnosa između promenljivih. Random Forest to postiže bagging pristupom i kombinovanjem više stabala, dok XGBoost koristi gradient boosting, gde svako novo stablo sekvencijalno ispravlja greške prethodnih. Iako modeli koji koriste nasumične šume imaju najbolju prediktivnu moć, svi oni imaju svoje prednosti i mane, i njihovi rezultati i primena se menjaju u zavisnosti od problema do problema koji je potrebno da reše, dodatno je bitno naglasiti da se svi modeli drugačije ponašaju i u zavinosti od toga da li su podaci normalizovani, linearni ili nelinearni, koliko resursa imamo na raspolaganju pri rešavanju problema i slično.

Zaključak¶

Ovaj seminarski rad se zasnivao na ideji implementiranja metoda Nauke o podacima i metoda Mašinskog učenja kako bismo kroz sve korake (eksplorativna analiza, čišćenje podataka, priprema podataka, feature engineering, feature selection, izgradnja modela, implementacija i poredjenje modela) pokušali da što optimalnije predvidimo vrednosti očekivanog životnog veka (Life expectancy).

Analizom podataka smo došli do zaključka da promenljive koje su na kraju odabrane za prediktore modela, BDP države, nivo edukacije države (Schooling), konzumiranje alkohola na nivou države, logaritamsku vrednost za zastupnost bolesti HIV/AIDS, kao i promenljiva dobijena "Feature engineeringom" immunization_index.

Takodje je primenom neregularizovane višestruke linearne regresije u odnosu na ciljnu promenljivu Life expectancy dobijen model koji sa ovako limitiranim brojem promenljivih opisuje 80% varijabilnosti očekivanog životnog veka.

Poredjenjem ovog modela sa ostalim modelima, možemo uočiti da modeli koji koriste regularizacione metode (Ridge i Lasso) su kaskali za modelima koji su najbolje uspevali da opišu variajbilnost očekivanog životnog veka i pružali najmanje vrednosti RMSE i MAE metrika.

Projekat bi se mogao unaprediti eksperimentisanjem različitih pristupa u feature engineering fazi, gde bi odredjene promenljive se mogle pretvoriti u kategorijske (HIV/AIDS,Schooling i slične). Za dodatno unapredjenje bi posebno pomoglo kada bi skup podataka sadržao konciznije vrednosti za promenljivu BMI, jedna ideja iza toga bi mogla biti uvezivanje tih vrednosti iz nekog drugog skupa podataka. Takodje bi bilo korisno pronaći outlier vrednosti preko Z-score metrike, dodatno bi mogli primeniti i testirati Support Vector Regression (SVR) model.

Ovaj rad je pokazatelj da pravilnim vodjenjem osnovnih principa Nauke o podacima i primenom metoda koje ona nalaže se mogu konstruisati precizni modeli za predvidanje očekivanog životnog veka jedne populacije. Model koji smo mi projektovali - model linearne regresije, se pokazao kao poprilično interpretabilan i efikasan pristup za ovaj problem koji je uz manji broj prediktora uspeo da objasni 4/5 varijabilnosti očekivanog životnog veka.

Reference¶

  • https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
  • https://www.nrigroupindia.com/e-book/Introduction%20to%20Machine%20Learning%20with%20Python%20(%20PDFDrive.com%20)-min.pdf
  • Microsoft Teams, Uvod u nauku o podacima
  • https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who